Percentage as dependent variable in multiple linear regression

Question

Although I saw a few similar threads, I don't believe I saw the specific answer to the following question:

For simple linear or multiple linear regression, if your dependent variable is a percentage, are any assumptions violated? I know that Y should be continuous, but does it also technically have to be unbounded? I've never seen this listed as one of the assumptions, though I understand how a bounded dependent variable can cause specific issues.

In my case, I'm doing a multiple regression project for school where the dependent variable is percentage of obese schoolchildren. Should I do a logit transformation or beta-regression because Y is bounded?

In response to a comment: the kernel density plot for Y(pct_obese) is below: It doesn't seem that there is bunching at the boundaries--rather, the bulk of the data hovers around 20%:

You're right that a bounded response may require different treatment. Do the data ever bunch up near the boundary? If so, this becomes especially important. — eric_kernfeld, Jun 12 '17 at 02:13
Please [*merge your accounts*](http://stats.stackexchange.com/help/merging-accounts)... https://stats.stackexchange.com/users/164903/sjc725 and https://stats.stackexchange.com/users/164905/sjc725 — Glen_b, Jun 12 '17 at 10:29
Your fitted model could predict more than 100 percent/less than 0 percent. Can this be an issue? — rep_ho, Jun 12 '17 at 19:52

gung - Reinstate Monica · Answer 1 · 2021-04-08T21:30:45.550

10

You should not use linear regression here, nor should you transform your data with the logit transformation. You have a percentage variable in a sense, but that's just a way to display your data in a simplified manner. In another sense, you have a count of obese children out of a known total of kids. That is, you have binomial data.

Thus, you should use logistic regression, using the counts of actual children. How that will be done, exactly, depends on how your software implements this, for a discussion of SAS and R, see: Difference in output between SAS's proc genmod and R's glm. People often think of logistic regression as the option to use when your response is 0/1, but it is actually applicable to any binomial distribution, even when there is more than one Bernoulli trial.

edited Apr 08 '21 at 21:30

answered Jun 19 '17 at 19:07

gung - Reinstate Monica

132,789
81
357
650

1

+1 @gung. It seems many posters have this same problem and don't understand that a percentage is simply a linear transformation or scaling of a proportion of the number of successes in $n$ events. Very nice, succinct answer. – StatsStudent Jun 19 '17 at 19:58
1

I am concerned that this answer might be misunderstood. As a clarification, I would suggest supplementing "use logistic regression" by "based on the actual counts, not the percentages!" – whuber Apr 08 '21 at 19:35
So let's say my response is a bunch of proportions, say 0.2, 0.3, 0.1, etc... I should convert that to counts so turn the row with 0.1 into 10 rows where 9 of them are 0 and 1 of them is 1? I am not understanding how I would convert my response to counts. – confused Apr 08 '21 at 20:23
I did find something that talks about the logit transformation - for other people's reference: https://stackoverflow.com/questions/44234682/how-to-use-sklearn-when-target-variable-is-a-proportion – confused Apr 08 '21 at 20:42
@confused, do you have the counts from which those proportions were calculated? – gung - Reinstate Monica Apr 08 '21 at 21:31
It's not really a count unfortunately. It's a percentage of a whole - if that makes any sense. I ended up just doing the logit transformation and running a regular linear regression. – confused Apr 08 '21 at 23:14
1

@confused, if it's not a count, then you shouldn't use logistic regression--that's the point of this answer. See: [Regression for an outcome (ratio or fraction) between 0 and 1](https://stats.stackexchange.com/a/29042/7290). – gung - Reinstate Monica Apr 09 '21 at 00:45

score 4 · Answer 2 · answered Jun 12 '17 at 02:39

In linear regression, there are several assumptions which include the normality and independence of the independent variables and the error terms. Based on theory, the expectation of dependent variable $E(Y)$ would be also normality distributed. However, in reality it's hard to see a perfectly bell-shaped distribution for the dependent variable $Y$. Hence, in my opinion, it's more important to ensure that the assumptions on the errors terms based on the fitted model is not violated.

You can perform diagnostic test such as QQ-plot to ensure that the residuals are (1) normally distributed and have (2) equal variance across the independent variables. Also, if you have a large $N$, the model is usually very robust against the assumptions. You could plot the histogram of the variables to ensure that there is no skewness or major deviations, else you can consider data-transformation.

Lastly to answer your question regarding the right model to use, I think that using a linear regression is justifiable as long as the there is a strong linear-relationship between your $X$ and $Y$ variable. As long as the model is not used to extrapolate outside the range of your fitted variables it will be fine. But of course, you could use models like log-binomial or beta-regression, which caters to {0,1} interval for $Y$, but each has different assumption on the dependent variable. You should first inspect the distribution of Y before deciding the appropriate model to use.

Hope this helps.

+1 for this in particular: "However, in reality it's hard to see a perfectly bell-shaped distribution for the dependent variable YY. Hence, in my opinion, it's more important to ensure that the assumptions on the errors terms based on the fitted model is not violated." Well-said. — Mark White, Jun 12 '17 at 03:01

Percentage as dependent variable in multiple linear regression

2 Answers2

Linked

Related