4

I have some aggregated data.

Indep.   Dep.    N
1.3      78%     23
1.2      67%     20

Indep var the average score of people in a particular region. Dep. var is the percent says 'yes' in that region. N is the count of subject the dep. is based on.

I have read people suggesting logistic regression to model dependent percent or proportion data, i.e., When Dependent Variables Are Not Fit for Linear Models, Now What?, I am confused as to how to do it in spss, and doesn't logistic reg requie the dep var to be binary?

Shayan Shafiq
  • 633
  • 6
  • 17
KubiK888
  • 927
  • 1
  • 10
  • 21

1 Answers1

2

The usual approach for modelling this type of data structure is to use a fractional logistic regression. The canonical paper is this one by Papke and Wooldridge. Further details can be found here and here. Sadly, it does not seem that there is an appropriate model in SPSS, but maybe others can help more here. (The model is, for example, available in R: see here.)

user2728808
  • 350
  • 2
  • 8
  • 1
    Papke and Wooldridge's 1996 paper has been deservedly influential among economists in particular, but the main idea was explicit in mainstream statistical literature long before. See Wedderburn, R. W. M. 1974. Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. _Biometrika_ 61: 439–447. http://doi.org/10.2307/2334725 Incidentally, the idea that continuous proportions are best thought of on logit scale also long predates the use for binary responses of logits as a link function (as we might now say) by Joseph Berkson and others. – Nick Cox Apr 06 '16 at 23:27
  • I defer to your superior knowledge! I should have mentioned that I am an economist... – user2728808 Apr 07 '16 at 00:05
  • That's why we have this site, to pool different personal resources, yours too. – Nick Cox Apr 07 '16 at 02:33
  • Thanks. Before delving into more advanced methods, I would like to apply the simpler approach as described in Papke and Wooldridge's paper. Does it mean if I have no 0 or 1 in my dep var (which is the case), I can transform the % into log-odds, and simply run the linear regression model? – KubiK888 Apr 07 '16 at 03:23
  • You could do that, but the model implied cannot plausibly match your generating process. For example, as the mean proportion approaches 0 or 1, so also the variance must approach 0; hence errors are heteroscedastic. Any good statistical software makes a better approach easily computable, and better solutions have been known for decades, so a fudged approach can't be defended convincingly. – Nick Cox Apr 07 '16 at 08:06
  • Thanks Nick, what might be a convincingly better approaches? I have tried beta regression, grouped logistic regression, fractional response regression, and GLM with logit link function. Are these appropriate approaches? – KubiK888 Apr 11 '16 at 03:34
  • @user2728808 Can you comment on my problem?https://stats.stackexchange.com/questions/296227/modeling-votes-on-discrete-outcomes-with-mlogit – user3022875 Aug 04 '17 at 18:03