GLM with logit link and Gaussian family to predict a continuous DV between 0 and 1

Question

Can you run a GLM using a logit link with a continuous DV (between 0 and 1)? Generally it's suggested to use a binomial family with a logit link, but I'm guessing that is because the model assumes a binary DV. If we have a continuous DV would we want to use a Gaussian family instead of binomial?

I apologize if this question doesn't make much sense: I have only a very basic knowledge of statistics, and am just trying to recalibrate a model specified by a colleague a number of years ago.

Related: http://stats.stackexchange.com/questions/26762 – amoeba Sep 04 '16 at 17:41 — amoeba, Sep 04 '16 at 17:41

score 10 · Accepted Answer · edited Jan 04 '16 at 15:59

10

You seem to want to use a fractional logit, i.e. a quasi-likelihood model for a proportion. The key here is that it is a quasi-likelihood model, so the family refers to the variance function and nothing else. In quasi-likelihood that variance is a nuisance parameter, which does not have to be correctly specified in your model if your dataset is large enough. So I would stick with the usual family for a fractional logit model, and use the binomial family.

edited Jan 04 '16 at 15:59

Nick Cox

48,377
8
110
156

answered Jan 04 '16 at 15:35

Maarten Buis

19,189
29
59

2

+1. Note that with continuous proportions, just like binary (0, 1) variables, there is a variance-mean relationship which necessarily rules out a Gaussian. Consider limiting cases. A mean of 0 implies all values 0 and so variance 0; similarly a mean of 1 implies all values 1 and so variance also 0. Hence variance must be largest for some intermediate mean proportion and the binomial is more nearly right, at least qualitatively. As @Gavin Simpson rightly points out, a beta regression may also be defensible. – Nick Cox Jan 04 '16 at 16:04
1

Note that the argument in my comment above is a little hand-waving. For example, it's possible in principle that all values are 0.42 and so variance would then also be 0. But in practice such cases don't need or deserve modelling. – Nick Cox Jan 04 '16 at 16:30

score 8 · Answer 2 · edited Jan 04 '16 at 15:59

8

If your data really are continuous proportions (the common example I see is % silt, clay, or sand in sediment samples - only one of these types for beta regression, all three for a Dirichlet regression) then a beta regression would suggest itself. It's not a GLM sensu McCullagh and Nelder, but it is part of the extended family of GLMs that look, walk, and quack like a GLM.

edited Jan 04 '16 at 15:59

Nick Cox

48,377
8
110
156

answered Jan 04 '16 at 15:47

Gavin Simpson

37,567
5
110
153

1

I (together with Nick) have worked with regression based on the beta and Dirichlet distributions, so I should be partial to them. However, I am slowly being convinced (based on numerious simulations) that a fractional (multinomial) logit tends to be more robust. The variance does no longer have to be correctly specified in a fractional logit, while it has to be correctly specified in beta or Dirichlet regression. If it is the variance that is of substantive interest, then a fractional logit won't do what you want, but otherwise a fractional logit would be my default model for fractional data. – Maarten Buis Jan 04 '16 at 16:08
@MaartenBuis Indeed; I didn't intend this to be taken as an either/or - I've also used both quasi-binomial and beta regressions. – Gavin Simpson Jan 04 '16 at 16:51
1

Why is beta regression not a GLM sensu strictu, @Gavin? – amoeba Sep 04 '16 at 16:22
1

With all the parameters to be estimated I didn't think you could write it down in the form required for GLMs sensu McCullagh & Nelder. In same sense that a negative binomial model doesn't fit the GLM scheme if the theta parameter is to be estimated too. – Gavin Simpson Sep 04 '16 at 16:32
1

Ultra-pedantic belated comment. The Latin here is, or should be, _sensu stricto_. I am as fond of Latin as anyone else but saying _strict sense_ would lose nothing here and would be an exact translation. (In contrast, an expression such as _ad hoc_ has a distinctive flavour worth savouring.) – Nick Cox Sep 07 '20 at 08:19
@NickCox When I used *sensu* this is just a throwback to the days where I used to count remains of microorganisms using a microscope and it was common to say "*sensu* C. piger" as in "in the sense of C. piger". And anyway, "*sensu*" is shorter than "in the sense of" and sounds more highfalutin :P – Gavin Simpson Sep 07 '20 at 16:54
That's fine by me and indeed people say things like "species _sensu_ Mayr", which is defensibly concise and precise. My point was just that _strictu_ is wrong as qualifying _sensu_; there is no such word. See e.g. https://www.latin-is-simple.com/en/vocabulary/adjective/7980/ It's an ablative, IIRC, and as said _stricto_ is needed. – Nick Cox Sep 07 '20 at 17:02

score 7 · Answer 3 · edited Jan 04 '16 at 20:14

Yes you can. The model parameters are still log-odds ratios, but they are estimated differently. Your model with such specifications is basically a nonlinear least squares, where a logit "S" curve is being fit to 0/1 outcomes so as to minimize the squared error. However, the contrasts to usual logistic regression are very well known: this approach puts very little weight on 0/1 outcomes since a proportional difference of 0.95 versus 0.96 is much larger when scaled by its binomial variance. Gaussian families do not assume any mean-variance relationship. That's why this approach is not often used.

If the results given you are proportions, then the burning question is: do you have the denominators for these proportions? e.g. is the 0.43 percent calculated out of $n=100$ or $n=200$ participants and/or does this value differ between the various observations you've obtained? If so, weighting the binomial likelihood gives equivalent inference to fully observed 0/1 counts.

In R, for instance, it will still give you warnings that you have used non-binary outcome variables, but the fitting algorithm does not "break" when inputting data of this format. Other software may prevent such approaches altogether so you will have to create product variables.

However, without such counts in place, other robust error estimation methods should be used. Others' suggestions of quasilikelihood seems like a reasonable choice.

+1. What if the data are *probabilities*? For example, data come from a psychological experiment where people were estimating probabilities of something; these predictions (between 0 and 1) are the DV. It's like logistic regression but instead of the binomial outcome we have probability itself. What is a reasonable approach then? — amoeba, Sep 04 '16 at 16:21
@amoeba I think the approach is still valid, provided the mean model is correct. — AdamO, Feb 20 '18 at 20:49

GLM with logit link and Gaussian family to predict a continuous DV between 0 and 1

3 Answers3

Linked