Appropriate regression model when dependent variable is between 0 and 1?

Question

I am performing a regression where my dependent variable is the value of a group's Simpson Diversity Index. This index value is constrained by $1/k$ and $1$ (where $k$ is the number of classes), though none of my values approach $1$. I know OLS regression is not suited for regression with a 'bounded' dependent variable, and my research on the appropriate method has pointed me in several directions, to include a logit transformation and a beta regression. Beta regression is well over my head, so I am considering the logit transformation, but still am looking for some advice on interpreting the resulting coefficients, and if this method is truly sufficient.

Additionally, some other questions: Do I just transform the dependent variable and leave the independent variables alone? Do I transform both? (By the way, my dependent variables include percentages, integers, and dummy variables.)

With the transformation, I have read that OLS would then be appropriate, but I have also seen suggestions for GLM.

Hi user27557, welcome to the site! Logistic regression is one possibility (that is a GLM with a logit-link). You'll find [many posts](http://stats.stackexchange.com/search?q=logistic+regression+interpretation) on how to interpret the output of a logistic regression. In addition, see [here](http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm), [here](http://www.ats.ucla.edu/stat/spss/output/logistic.htm) and [here](http://www.ats.ucla.edu/stat/stata/library/sg124.pdf). — COOLSerdash, Jul 02 '13 at 17:10
Whether you need to transform your predictors is unpredictable and is not a logical consequence of how you handle the response (independent and dependent variables in your terminology). Just as with classical linear regression, any transformation choice should depend on the relationships between variables. — Nick Cox, Jul 02 '13 at 17:24
Thanks for the responses COOLSerdash and Nick. I have seen some of the logistic regression posts. While the interpretation of the logistic regression makes sense to me in the context of a binary response, I have not read anything that adequately addresses the use of the method and its interpretation when the dependent variable is already a form of probability (before any transformation to odds ratio). Do you know of any good links to this? Perhaps it's in the Stata bulletin you posted, and I'm just misinterpreting the terms. — Carter, Jul 02 '13 at 17:41
There is nothing mysterious about this. A logistic curve for population growth is a classic (if highly simplified) model for population growth in what I guess is your own discipline, ecology (Verhulst, Lotka, Pearl, etc., etc.). So continuous logistic (logit) models long predate Berksonian logit models for binary responses. Extending that to several predictors makes it trickier to visualize but all that is central is that predictions must be bounded if the response is (and it is, as a proportion). — Nick Cox, Jul 02 '13 at 17:58
Also, beta regression is not so difficult. It's the same idea: distributions for the response must be bounded if the response is. — Nick Cox, Jul 02 '13 at 18:01
Take a look at http://stats.stackexchange.com/questions/49443/how-to-model-this-odd-shaped-distribution-almost-a-reverse-j — rolando2, Jul 02 '13 at 22:30
If the dependent variable is a proportion then the exponentiated coefficients in a fractional logit or a beta regression are related to, but not the same, as odds ratios. Nick and I have called this a "relative proportion ratio" in our `betafit` program, and there is a discussion of that interpretation in the help-file: http://repec.org/bocode/b/betafit.html — Maarten Buis, Jul 03 '13 at 09:00
So I can interpret the coefficients as described in the 'relative proportion ratio' section of the above link, even given that my dependent variable is not a relative proportion/odds ratio? — Carter, Jul 09 '13 at 15:45
For what it's worth, here are the descriptive statistics of my data: Sample Size: 136 Range: 0.15222 Mean: 0.12035 Variance: 0.00119 Std. Deviation: 0.03447 Coef. of Variation: 0.28642 Std. Error: 0.00296 Skewness: 1.1795 Excess Kurtosis: 0.89617 Min: 0.0746 Max: 0.22682 — Carter, Jul 09 '13 at 15:50

score 3 · Answer 1 · answered Jul 02 '13 at 17:43

3

Do you have any values of the response that are exactly 0 or 1? (those will cause problems with a logit transform)

Have you tried plotting your data? What exploratory techniques have you used? What have other researchers in the area done?

You could try simulating some data that fits with a logit transform or a beta regression model (or anything else that you consider trying) and see how that compares to your data to get a better feel for which model may be more appropriate.

With what you have given us, we can only make suggestions, you need to decide on what makes the most sense based on your understanding of the data, the science behind it, and what questions you are trying to ask. You may also need to consult with an expert in the area and/or a professional statistician. Choosing to not do a beta regression because it is beyond you is like having your doctor say that you may need brain surgery, but he is going to take out your appendix instead because brains are beyond his experience, but he is good with appendixes.

answered Jul 02 '13 at 17:43

Greg Snow

46,563
2
90
159

Exact 0s and 1s don't rule out a GLM approach. Having lots of 0s and/or 1s would however raise the question of whether you need a different model. However, OP did say no 0s, no 1s. – Nick Cox Jul 02 '13 at 18:00
Thanks. To clarify, I only aim to avoid beta regression if other methods are equally satisfactory given the circumstances. If it is necessary, then it should be done. No values are 0 or 1. Minimum value is 1/17 (0.0746), max value is 0.2268. (That is min and max of the dependent variable...the Simpson.) – Carter Jul 02 '13 at 18:09
0 down vote This is sort of an interesting issue to me - isn't the sort of regression you choose motivated by either the conditional distribution of y, or the loss involved in predicting the wrong value? So if you do a beta regression are you estimating both the conditional distribution and the regression parameters simultaneously? – alex Jul 02 '13 at 18:59

score 0 · Answer 2 · edited Jul 03 '13 at 08:24

0

You know $k$ but you used dependent variable $1/k$. Do not divide, but use values of $k$ as dependent variable. As you say $k$ is number of classes, so you should see the regression with categorical dependent variable. For reference you should look here

and I think you should avoid $1/k$ if you used any other regression or method. Because as you have more classes result become near zero and for small classes result close to 1, and that yields misleading results over independent variables.

edited Jul 03 '13 at 08:24

Nick Cox

48,377
8
110
156

answered Jul 03 '13 at 08:17

SAAN

531
5
16

1

No. This misses the point with this response variable. Simpson's index ranges from 1 when only one kind is present to 1/$k$ when $k$ kinds are equally common, but $k$ is not equivalent information. It can make much sense to model $k$ directly (e.g. number of species seen in ecology), but that's not the same problem at all. – Nick Cox Jul 03 '13 at 08:20
Yes, Nick is correct. The dependent variable is not 1/k...1/k is the minimum possible Simpson value. A group has a Simpson value of 1/k when its population is equally distributed across k classes, as Nick states. A group has a Simpson value of unity when its population is concentrated in a single class. As diversity of the group increases, the Simpson value decreases. Thus, my dependent variable is not strictly a proportion or percentage, but one might interpret it as the probability of any two randomly selected group members being in the same class. – Carter Jul 03 '13 at 13:37

Appropriate regression model when dependent variable is between 0 and 1?

2 Answers2

Linked