56

I am thinking of building a model predicting a ratio $a/b$, where $a \le b$ and $a > 0$ and $b > 0$. So, the ratio would be between $0$ and $1$.

I could use linear regression, although it doesn't naturally limit to 0..1. I have no reason to believe the relationship is linear, but of course it is often used anyway, as a simple first model.

I could use a logistic regression, although it is normally used to predict the probability of a two-state outcome, not to predict a continuous value from the range 0..1.

Knowing nothing more, would you use linear regression, logistic regression, or hidden option c?

amoeba
  • 93,463
  • 28
  • 275
  • 317
dfrankow
  • 2,816
  • 6
  • 30
  • 39
  • 5
    Have you considered beta regression? – Peter Flom May 23 '12 at 22:45
  • Many thanks to all who answered. I will have to study up and choose. Sounds like a beta is a decent place to start, especially if I can observe a good fit (perhaps by eye). – dfrankow May 24 '12 at 03:47
  • I've seen this done using GLM (poisson link function). The numerator **a** would be the count data (the outcome) and the denominator **b** would be the offset variable. You would then need separate **a** and **b** values for each subject/observation. I'm just not sure if this is the most valid option. I find the Beta distribution an interesting option - one that I had not heard of. However, I find it difficult to grasp, being a non-statistician. – MegPophealth Apr 04 '14 at 18:23
  • Thank you all of you for your deep and useful analysis, I am currently facing almost the same challenge, but instead of predicting a continuous ratio range between 0-1, I rather want to build a regression model to predict patients utility range between -1 and 1. This is quite tricky, I couldn't find any link function appropriate to build a regression model with a continuous dependent range between -1 and 1. So guys just want to have clue about what could be done. Thanks, –  Aug 22 '14 at 08:31

4 Answers4

41

You should choose "hidden option c", where c is beta regression. This is a type of regression model that is appropriate when the response variable is distributed as Beta. You can think of it as analogous to a generalized linear model. It's exactly what you are looking for. There is a package in R called betareg which deals with this. I don't know if you use R, but even if you don't you could read the 'vignettes' anyway, they will give you general information about the topic in addition to how to implement it in R (which you wouldn't need in that case).


Edit (much later): Let me make a quick clarification. I interpret the question as being about the ratio of two, positive, real values. If so, (and they are distributed as Gammas) that is a Beta distribution. However, if $a$ is a count of 'successes' out of a known total, $b$, of 'trials', then this would be a count proportion $a/b$, not a continuous proportion, and you should use binomial GLM (e.g., logistic regression). For how to do it in R, see e.g. How to do logistic regression in R when outcome is fractional (a ratio of two counts)?

Another possibility is to use linear regression if the ratios can be transformed so as to meet the assumptions of a standard linear model, although I would not be optimistic about that actually working.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 2
    Would you mind elaborating on why beta regressions would be preferable in this case? That's a recommendation I see fairly often here, but I don't really see anyone elaborating on the rationale - that would be nice to have! – Matt Parker May 23 '12 at 23:02
  • (Though I'm finding the intro to [the betareg vignette](http://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf) quite informative) – Matt Parker May 23 '12 at 23:04
  • 6
    @MattParker, Beta is the distribution of continuous proportions--if that's what you have as your response variable, then Beta is the appropriate distribution to use. It's really that simple. The fitted value from a logistic regression is a probability (which is obviously continuous), but the distribution is binomial (some number of Bernoulli trials w/ success probability $p$) if your response variable is not a set of Bernoulli trials, then LR is not appropriate. – gung - Reinstate Monica May 23 '12 at 23:21
  • 4
    I would be careful saying that a beta is "the" appropriate distribution to use. It's fairly flexible and it might be appropriate but it doesn't cover all cases. So while it's a good suggestion and may very well be what they want - you can't really say that it's the appropriate distribution solely on the fact that it's a continuous response between 0 and 1. – Dason May 23 '12 at 23:28
  • 2
    A triangular distribution on [0,1] represents a continuous distribution on proportions that is not a beta. There could be many others. The beta is a nicw flexible family but there is nothing magic about it. You do make a good point about logistic regression because it is usuaLLY applied to binary data. – Michael R. Chernick May 24 '12 at 02:07
  • 2
    Perhaps I should try to seem less dogmatic. What I meant is that you examine your DV & use the distribution it follows. True, there are other distributions of continuous proportions. Technically, Beta is the ratio of a Gamma over the sum of it + another Gamma. In a given situation, a different distribution *could* be superior; eg Beta cannot take the values 0 or 1, only (0, 1). Nonetheless, Beta is well understood and very flexible with just 2 parameters to fit. I argue that when dealing w/ a DV that is a continuous proportion it is typically the best place to start. – gung - Reinstate Monica May 24 '12 at 02:27
  • One question (maybe I should submit it separately): are the coefficients of the beta fit interpretable? One thing I like about logistic (which I'm not planning to use) is its interpretability: each coefficient is the log odds of the particular factor's influence on the result. – dfrankow May 25 '12 at 21:22
  • It's in the vignette (although it may not be obvious). Beta regression is like any old GLM: the betas are coefficients relating covariates to an estimated parameter of the conditional distribution of the response via a link function. In this case, the response distribution is Beta, and the estimated parameter is $\hat\mu$, the mean. The key for how to interpret the betas is to understand the link function you use; the logit is most common but other possibilities exist. If you use the logit, the betas would be interpreted just as w/ logistic regression (ie, as changes in log odds). – gung - Reinstate Monica May 26 '12 at 03:18
  • 1
    What do I do when my data is a proportion between 0 and 1 but can therefore also be 0 or 1. Using the betareg function I get the error "all observations must be in (0, 1)" while the range of my response is 0.000 - 0.935. – crazjo Aug 19 '14 at 11:36
  • @JolJols, Beta is only supported on (0,1); it cannot = 0 exactly. If your data have real 0's, they aren't distributed as Beta (although it may be a good enough approximation). I wonder if you had really small values that were rounded to 0, though. You could try adding a very small value to your data, or re-scaling them. You might find [this thread](http://stats.stackexchange.com/q/30728/7290) helpful, or you could ask a new question. – gung - Reinstate Monica Aug 19 '14 at 13:17
  • @gung This won't work for data that have real 0 or 1 values. Need something like a truncated normal glm between 0 and 1 with a logit link function. – colin Mar 02 '16 at 19:57
  • There is a bunch of similar threads on CV, and the advice that is usually given is that if the DV is a count proportion then one should use binomial GLM; only if the DV is a continuous proportion or probability then one should use beta GLM. See e.g. today's answer by Glen_b: http://stats.stackexchange.com/a/239395/28666 or this popular thread http://stats.stackexchange.com/questions/26762 and Greg's answer. Note that OP here asked about count proportions and your answer is currently in conflict with these other answers. Would you perhaps want to make an update with discussion / further links? – amoeba Oct 10 '16 at 14:00
  • @amoeba, I don't see where in the question you are seeing that this is a count proportion. It seems to say that the response is "a continuous value from the range 0..1", & not "a two-state outcome". Certainly if it were a count proportion / a two-state outcome, then logistic regression would be appropriate. – gung - Reinstate Monica Oct 10 '16 at 14:54
2

Are these paired samples or two independent populations?

If independent populations, you might consider log(M) = log(B) + $X_i$*log(ratio). M is your measurement (a vector containing all values of A and B) and X is a vector $X_i$ = 1 if $M_i$ is a value of A, $X_i$ = 0 if $M_i$ is a value of B.

Your intercept of this regression will be log(B) and your slope will be log(ratio).

See more here:

Beyene J, Moineddin R. Methods for confidence interval estimation of a ratio parameter with application to location quotients. BMC medical research methodology. 2005;5(1):32.

EDIT: I have written an SPSS addon to do just this. I can share it if you're interested.

Ferdi
  • 4,882
  • 7
  • 42
  • 62
DocBuckets
  • 1,733
  • 1
  • 11
  • 13
  • 1
    Out of curiosity which method did you use (delta, Fieller or GLM)? It slays me a bit that the BMC article did not do some simulations of the coverage of the different estimators (although to dream up a realistic simulation would be annoying). I was reminded because I recently [came across a paper that does the delta method](http://dx.doi.org/10.1177/0042098012466601) (with no real justification), although it does cite the BMC article. – Andy W Jun 11 '13 at 19:53
  • 1
    Back when I wrote this comment, I used `REGRESSION` after log-transforming the data. Since then I've written a more sophisticated version that uses `GLM`. I deal with light emission measurements and my testing suggested gamma regression with a log-link was the least prone to runaway uncertainty on the parameters. For most of my real data, the answers from using normal, negative-binomial, and gamma with log-link were all really similar (at least to the precision I needed) – DocBuckets Jun 11 '13 at 21:52
0

Not true. The data for logistic regression is binary 0 or 1 but the model predicts p say the probability of success given the predictors $X_i$, $i=1,2,..,k$ where $k$ is the number of predictor variables in the model. Actually because of the logit function the linear model predicts the value of log($\frac{p}{1-p}$). So to get the prediction for p you just do the inverse transformation $p=\frac{\exp(x)}{[1+\exp(x)]}$ where $x$ is the predicted logit.

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • -1. I don't see how this answers the question (and in addition $p$ is used to refer to two different things in this answer). – amoeba Feb 15 '17 at 09:11
  • 2
    -1. I agree with @amoeba. I am puzzled why this was ever upvoted. It does not bear on the question, which doesn't assume binary data 0 or 1 at all but is focused on measured proportions which are between 0 and 1 inclusive. – Nick Cox Feb 15 '17 at 10:20
0

We can use sample_weights in SVM-C or any other classifier with weights being the ratio. There would be two data points for each data point in the original case:

  1. With 1 as target variable where sample_weight is equal to ratio
  2. With 0 as target variable where sample_weight is (1-ratio).

Consider the effect of sample_weights here on SVM-C https://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html

We can then use the predicted probability for target value 1 as our ratio estimate. Here weights refer to the sample_weights in the fit method of sklearn

  • 3
    It is difficult to follow what you mean by "logistic regression with weights," because you haven't specified how you intend a logistic regression to be weighted. – whuber Feb 26 '20 at 16:44
  • Here I refer weights as 'sample_weights' which is how we weight each sample in our observation https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit – Imroz Khan Feb 27 '20 at 16:14
  • 2
    There is nothing like a sample weight apparent in the setting of the question. It's unclear what you mean by "proportion percentage"--your language is vague. – whuber Feb 27 '20 at 16:28