How can standard logistic regression model fractional response variable while denominator is available?

Question

I have X and Y variables, as well as a cluster variable (State). X and State are derived from Database A, while Y and State are derived from Database B.

X is a sentiment score ranging between -1 and 1, while Y is a yes or no (0 or 1) response.

In Database A, I aggregate X into average-X by state, while in Database B, I aggregate Y into percentage-Y by state. Then I combine the two datasets as follows:

In the combined data structure, my new outcome is percentage-Y, while I do have the numerator and denominator that give rise to percentage-Y.

I have heard that from here - "The most natural way fractional responses arise is from averaged 0/1 outcomes. In such cases, if you know the denominator, you want to estimate such models using standard probit or logistic regression".

It seems since I do have the denominator information, I can avoid using the Fractional outcome regression and just stick with the standard Logistic regression.

However, how exactly can I model a logistic regression based on the denominator information?

see http://stats.stackexchange.com/questions/164120/interesting-logistic-regression-idea-problem-data-not-currently-in-0-1-form/164127#164127 — , Jul 22 '16 at 06:39

score 1 · Answer 1 · answered Jul 22 '16 at 02:00

First note that if you know the percentage and the denominator, then you also know the numerator. So, for example, if you know for a specific class (in your example, the class is state) that the ratio of positive to negative classes in a class is $0.6$, and the denominator of the ratio is $10$, then you immediately know that there are

6 positive (y = 1) cases in that class.
4 negative (y = 0) cases in that class.

With this information you can, in principle, create a new dataset expanding your grouped data. In this example you would end up with

6 rows for the class with $y = 1$.
4 rows for the class with $y = 0$.

Now you can use this new dataset to fit a logistic regression.

In practice, you simply observe that each row in this imaginary expanded data set contributes one term to the loss function

$$ L = \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i) $$

and each of the expanded rows in a class where $y = 1$ contributes the same amount, with the same thing holding for the rows where $y = 0$. So, instead of actually physically creating the expanded data set, we can just apply integer weights to the terms in our loss function

$$ L = \sum_i w_i y_i \log(p_i) + w_i' (1 - y_i) \log(1 - p_i) $$

where the $w$s and $w'$s are the number of positive and negative cases in each class.

In the imaginary expanded dataset, we expand the rows by different values of y, but what should fill up the X values? Just the average? — KubiK888, Jul 22 '16 at 15:37
@KubiK888 Sorry for the late reply, I missed your comment. In the example I gave, I was thinking of state as being a predictor, in which case the predictor is constant across the group and there is not real issue. If the predictor is on constant, then yes, you can infer the average, or some other summary statistic. There are also multi-level/hierarchical generalized linear models, which better deal with this kind of situation. — Matthew Drury, Jul 26 '16 at 21:36
Maybe I am not making it clear with my example... But I have thought about multilevel analysis as well. But I am not sure if and how I can incorporate into this analysis. My understanding is I need individual-level data that contains X, Y, and State variables all in one dataset. But I don't have that luxury. The individual data came from two different datasets: dataset A has X and State; and dataset B has Y and state. I then compressed each by State and combine the two datasets by the state variable in order to assess relationship between X and Y. — KubiK888, Jul 27 '16 at 15:35

How can standard logistic regression model fractional response variable while denominator is available?

1 Answers1

Linked