12

I have a dataset where I'm trying to predict a student's percentage score on a school exam from some other collected IVs. I'm wondering how to correctly run this analysis in R.

A few related questions:

First, I've seen a number of questions on this topic (e.g. What are the issues with using percentage outcome in linear regression?) where the usual recommendation is to use logistic regression. I'm a bit confused about how logistic regression fits into this. I've never formally taken a class on logistic regression, and my understanding (from the machine learning world) is that the evaluation of the linear input through a logistic function is interpreted as the probability that this data point belongs to the majority (1) class. However, I don't have classes per se, so I'm confused about how a logistic model actually works in this case as I essentially have a continuous variable that is bounded by 0 and 100.

I've also seen Poisson regression being recommended in some places. However, that appears to predict integer values, and is likely not a good model for this?

Finally, what are the steps for running this in R? Would I convert my percentage DV to a logit, run the regular lm() function and interpret the coefficients (and their p values) like normal? Is the interpretation of the significance the same as linear regression (i.e. the IV is a significant predictor of the percentage DV when holding all other IVs constant)?

EDIT:

In the comments, it has been suggested that the answer can be found here: How to do logistic regression in R when outcome is fractional (a ratio of two counts)?

The solution there, as proposed by Greg, is to use 2 columns, one specifying a proportion and another to specify the weight (number of total points). This won't work in my case.

In the dataset I only have access to the final percentage/proportion. The data comes from different classes and I have no way of knowing the number of points the individual received on the exam, nor the total number of points available for that exam (as it would likely be different for individuals in different classes)

Option #1 in Greg's post is also not possible because I don't have a binary/categorical response.

Simon
  • 1,741
  • 3
  • 26
  • 38
  • 3
    Look at the comment of @Glen_b in the link you include. That's very helpful. Is your percentage score the ratio of two counts (e.g., number of right answers divided by number of questions in the exam) or is it actually a continuous variable? – DeltaIV Feb 14 '17 at 09:27
  • 1
    the % score is a ratio of counts (# of points / total number of points available). However, the data comes from different exams where the total # of points differ, which is why I'm using a percentage instead of a count of points as the DV. Is logistic GLM still used for this? How is it different from logistic regression? And how does one run this in `R`? – Simon Feb 14 '17 at 16:00
  • More specifically, see http://stats.stackexchange.com/questions/26762 – amoeba Feb 15 '17 at 09:01
  • @DeltaIV You misunderstood Gleb_b and what you say in the last comment above is wrong. – amoeba Feb 15 '17 at 09:06
  • 1
    I made an edit to my question to address the solution found in that post. I don't think it will work in my case as I only have access to the percentage values, and not the true counts/totals – Simon Feb 15 '17 at 10:03
  • 2
    I have to disagree with @amoeba. There is no real problem in applying logistic regression to measured proportions (more generally, doubly bounded outcomes). Not having binary data does not bite. Not knowing original numerator and denominator does not bite. You just need software that doesn't throw you out because your data are not 0 and 1. You will need to scale to proportions, which is trivial. You need to ask for the right kind of standard error. This model has been known since at least 1974. – Nick Cox Feb 15 '17 at 10:51
  • (Also, the logistic as a function for continuous outcomes is a 19th century idea long preceding its adoption by Berkson in 1941 as a link function for binary responses.) – Nick Cox Feb 15 '17 at 10:51
  • 1
    Poisson isn't confined to integers, but it doesn't sound a good idea here. The variance structure there is quite different. Here is a friendly reference: http://www.stata-journal.com/sjpdf.html?articlenum=st0147 The 1974 allusion is https://academic.oup.com/biomet/article/61/3/439/249095/Quasi-likelihood-functions-generalized-linear – Nick Cox Feb 15 '17 at 10:55
  • Thanks @NickCox, yes indeed this is an option too. I am actually aware of it, but wrote my previous comment too much in a hurry to mention it. In R this can be done with the "quasibinomial" distribution in the `glm` function. In Stata I think this is called "fractional logit" (?). This is mentioned e.g. in http://stats.stackexchange.com/questions/43366 and http://stats.stackexchange.com/questions/62679 (boy do we need a good comprehensive answer about all that!) – amoeba Feb 15 '17 at 10:58
  • @amoeba Fine, but your comment that you cannot logistic regression without access to original counts remains incorrect or at least misleading. Stata does have a command `fracreg` but it's just a wrapper for long standard ideas and "fractional regression" is just another name. – Nick Cox Feb 15 '17 at 11:02
  • ... cannot use ... – Nick Cox Feb 15 '17 at 11:09
  • @NickCox Yes I agree that that statement is misleading (and I upvoted your comment). In my defense I can only say that "logistic regression" is usually understood to be about counts, and one could argue that what you are referring to should even be called differently (that's why in R it's called "quasibinomial" and not simply "binomial"; the latter outputs wrong standard errors). I think I will keep that comment for now, so that the conversation remains understandable. – amoeba Feb 15 '17 at 11:32
  • @amoeba Not wanting to be difficult, but I would still differ on one point. If there is a usual understanding of logistic regression, it is that the method focuses on binary outcomes or responses, coded 0 or 1. Whether they arrive as counted frequencies is incidental and often they will **not** be presented that way. Quasibinomial as a term is fine by me as a way of flagging that the assumptions (ideal conditions!) are different when the response is a measurement or ratio between 0 and 1. – Nick Cox Feb 15 '17 at 11:39
  • @NickCox No difference here, of course you are right - the usual understanding is that the outcome is binary. I meant that binary is a rudimentary count too (either 0 or 1 out of 1), so a count proportion response (such as e.g. 17 out of 100) fits to this scheme exactly, whereas a measurement between 0 and 1 such as 0.17 requires some conceptual leap, or generalization. I think we agree on the subject matter. And we also agree that I should have been more precise in the above comments. – amoeba Feb 15 '17 at 11:45
  • @amoeba Good. We converged. – Nick Cox Feb 15 '17 at 11:55
  • @NickCox By the way, consider adding an answer to http://stats.stackexchange.com/questions/29038 which looks to me as potentially the "master" thread on this topic, but where gung's accepted answer only mentions beta and logistic options (and hints at the logit-transform followed by linear regression option). No answer in that thread discusses quasibinomial / fractional logit / logistic regression with robust standard errors; this would be an important addition. Personally, I voted to close *this* thread as a duplicate of that one (and other similar threads are already closed as duplicates). – amoeba Feb 15 '17 at 11:58
  • @amoeba ok, my comment was wrong, I deleted it. Well, at least I helped drawing attention to the question: judging by the number of comments, it's not that easy to find a comprehensive thread on CV discussing all the cases. However, forgive me, but with all these comments and links I'm a bit confused...[1/2] – DeltaIV Feb 15 '17 at 12:48
  • [2/2]..Is it correct to say that: if successes and failures are available for each data point, then use the approach described [here](http://stats.stackexchange.com/a/26779/58675). If only the ratio of successes and failures is available at each data point, then use `glm`, but set `family=quasibinomial`. Correct? I'm amazed at the variety of different problems which can all be tackled with Generalized Linear Models. – DeltaIV Feb 15 '17 at 12:49
  • 1
    @DeltaIV Yes, the above is correct with the caveat that `glm` with `family=quasibinomial` is only one possible option, another being beta regression (`betareg`). – amoeba Feb 15 '17 at 13:10
  • 1
    @DeltaIV See also my answer http://stats.stackexchange.com/a/233664/28666 to my own question. The first part of that answer discusses several available options. – amoeba Feb 15 '17 at 13:50
  • A question about beta regression as an option: a comment in the other thread says that it cannot take on values of 0 or 1. Why is that distribution a possibility here? surely any type of ratio data like this could potentially have 0/1 values? – Simon Feb 15 '17 at 15:04

1 Answers1

10

One of the following three solutions might work for you. However, I am curious what other will suggest:

  1. You can use simple linear regression. However, that procedure might violate some assumptions of linear regression (depends on your actual data). Inferential statistics such as p-values and/or confidence-bands might not be trustworthy. Moreover, your model might predict scores outside of the boundary, which makes interpretations difficult.

  2. You can transform the percentage scores into logits and use them as outcome for the linear regression. Here ist the transformation formula: ln(p/(1-p)) By doing that, you adapt the link-function of logistic regression to the linear regression. That might solve some of the previous problems especially the last one because logits boundaries are -infinite and +infinite. However, you loose interpretability. (Edit: A short discussion of this approach and why it is not recommended can be found in the source linked for the third solution)

  3. Beta-Regression might be the model you are looking for. The following vignette shows how to apply beta regression in R using Cribari-Neto's and Zeileis's "betareg"-Package: ftp://cran.r-project.org/pub/R/web/packages/betareg/vignettes/betareg.pdf

I hope that some of my suggestions might help you!

Jake Westfall
  • 11,539
  • 2
  • 48
  • 96
StatisticsRat
  • 321
  • 2
  • 6
  • Given that OP is talking about ratios of two counts (as clarified in the comments), *none* of these three options is actually appropriate. The most suitable approach is logistic regression. – amoeba Feb 15 '17 at 09:05
  • 4
    #1 has its fans as the linear probability model; I am with those who think it's a bad idea usually. The answer misses out what is usually the best solution in my experience, generalized linear models with logit link, binomial family and robust standard errors. See my comments under the question. – Nick Cox Feb 15 '17 at 11:12
  • Update to my comment above: OP has further clarified that the counts themselves are not available, so this excludes "vanilla" logistic regression that I meant above, but one can still use logistic regression with appropriately modified standard errors (aka "quasibinomial" GLM in R), as NickCox wrote. – amoeba Feb 15 '17 at 12:02