Which link function for a regression when Y is continuous between 0 and 1?

Question

I've always used logistic regression when Y was categorical data 0 or 1.

Now I have this dependent variable that is really a ratio/probability. That means it can be any number between 0 and 1.
I really think "logistic" shape would fit very nicely, but I remember categorical Y was a big deal when proving why MLE works.

The point is, I am wrong using logit regression for this Y or it doesn't matter? Should I use probit instead?
Am I committing a capital crime?

Can you describe the actual data you have a little bit more? For example, with categorical predictors, the responses in logistic regression can be reinterpreted as binomial random variables with one variable per unique set of predictors. If your data are similar in some way (e.g., ratios of counts), then something similar may apply. If not, knowing more about the data may still point toward a direction to take. — cardinal, Oct 13 '11 at 00:07
Basically, my data refers to "group of observations". Example: Y_i is the proportion of smokers in group i. I have many such groups, all of the same size and each of them associated to a parameter X. — CarrKnight, Oct 13 '11 at 00:47
Ah, very good! In that context, your proportions are equivalent to counts and you know exactly how the proportions should be weighted (i.e., equally since there are an equal number of members in each group---this is what I was hinting at in my comments in the answers). This is a textbook scenario for logistic regression. :) — cardinal, Oct 13 '11 at 00:49
glad to know that everything worked out. And thank you for the comments, they were quite instructive! — CarrKnight, Oct 13 '11 at 00:56
Just to make @cardinal's point clear, you should *not* do logistic regression with Y as the dependent variable. (The likelihood makes no sense.) You need two variables: the total and the observed count (their ratio equals Y). Logistic regression is done using the observed count as the dependent variable and the total as a *frequency weight.* The two approaches are not equivalent: they will lead to different results, especially concerning standard errors of coefficients and p-values. — whuber, Oct 13 '11 at 03:18
@whuber: I think you have it the wrong way round. In most logistic regression packages that I know of, modelling Y (a proportion 0 <= Y <= 1) will work fine, with the totals as the frequency weights. In some packages, eg R, you can also use the counts of positive and negative responses (a matrix) as the response; in this case you leave out the frequency weights. — Hong Ooi, Oct 13 '11 at 07:47
@Hong Thanks for the clarification. I think we may be in agreement, except perhaps on the software details, which will vary by program (and even within programs according to their input options). The point is that passing a set of (x, proportion) values to a logistic regression package won't do what one expects; the package needs to get information that's equivalent to (x, total, proportion of total) and to use the total as frequency weights. — whuber, Oct 13 '11 at 17:12

score 5 · Accepted Answer · answered Oct 13 '11 at 00:21

There's nothing wrong per se with using "logistic regression" for this kind of data. You can think of it as an empirical adjustment to allow fitting a response that has a bounded support. It's better than the alternative (logit-transforming your response, then using ordinary linear regression) because the resulting predictions are asymptotically unbiased, the mean predicted value equals the observed mean response, and (probably the most important) you don't have to worry about situations where Y equals 0 or 1. The arcsin transformation can handle Y = 0 or 1, but then your regression results aren't so easily interpretable in terms of log-odds ratios.

The main thing to look out for is that, as with any generalized linear model, you are implicitly assuming a particular relationship between the $E(Y|X)$ and $\textrm{Var}(Y|X)$. You should check that this assumption holds, eg by looking at diagnostic plots of residuals.

For most cases, doing a probit regression will give very similar results to a logistic regression. An alternative is to use the complementary-log-log link if you have reason to believe there is asymmetry between Y = 0 and 1.

It seems to me that without further information, any quasi-likelihood that was constructed could result in biased estimates of (some subset of) the parameters, even asymptotically. — cardinal, Oct 13 '11 at 00:44

score 1 · Answer 2 · answered Oct 13 '11 at 00:10

1

Link functions convert the expected value of Y (given X) to something that is unbounded. While in logistic regression, Y takes values 0 or 1, the logit isn't applied to Y but to Pr(Y=1|X). (The logit of 0 and 1 are each undefined.) So it's perfectly reasonable to use the logit or the probit in this case.

The other thing to think about is the residual variance: is there a particular transformation that would best stabilize the variance for your case? For proportions, the arcsine square-root transformation is often used, as it is variance-stabilizing for binomial proportions. Consider the discussion here.

answered Oct 13 '11 at 00:10

Karl

5,957
18
34

1

While it is true that the logit is applied to the mean, it is not clear (to me, at the moment) what the likelihood would (sensibly) be in this context without some further information or assumptions. – cardinal Oct 13 '11 at 00:41
@cardinal - Until we got the additional information that this was really just a logistic regression situation, I was thinking that he wanted to do standard regression but with a transformation of the outcome, in which case the model would be $f(y) = X\beta + \epsilon$, where $f$ = chosen link (say the logit) and $\epsilon \sim \text{N}(0, \sigma^2)$. – Karl Oct 13 '11 at 03:47

Which link function for a regression when Y is continuous between 0 and 1?

2 Answers2

Linked