What happens when switched from more aggregated to less aggregated unit in ecological regression

Question

Sample state data

State     Percent_ethnicity=1   Percent_voting=1
A         20%                   60%
B         56%                   65%

Sample city data

city   State     Percent_ethnicity=1   Percent_voting=1
1      A         77%                   70%
2      A         46%                   25%
1      B         56%                   67%

I have aggregated race and voting choice data at the state and city levels. Say, the ecological regression model looks like this at the state level (where j is states, Y is percent voting for 1 (only 2 options: 0, 1), X is the percent of race being 1 (only 2 options: 0, 1), E is the error term):

Yj = a + b*Xj + Ej

I would like to model a and b via a linear regression model.

After that say, I am switching to analyzing the ecological regression at city level (where p is cities, Y is percent voting for 1 (only 2 options: 0, 1), X is the percent of race being 1 (only 2 options: 0, 1), N is the error term):

Yp = a + b*Xp + Np

Would I be able to infer anything without seeing the results how the a, b, and E/N estimates compared/changed between the two ecological models?

You do not really want to model a binary outcome using linear regression. — Björn, Mar 02 '16 at 06:48
Individual data is binary, but they are aggregated at the state and city levels as percentages — KubiK888, Mar 02 '16 at 07:36
@Björn has a good point. If you know the numbers who voted (not just the percentages) then a [logistic model](http://stats.stackexchange.com/a/88052/28500) would be appropriate. Otherwise, you could consider [beta regression](http://stats.stackexchange.com/a/29042/28500). Note that your confidence in a percentage value will be higher in cases where more votes were cast. Standard linear regression is risky here. Also, do you have data on all cities (and towns and unincorporated areas etc) within each State so that a nested analysis (city within State) is possible? — EdM, Mar 04 '16 at 14:46
I am not sure what it means by doing logistic regression, I have different data sets providing each of the data and only linkable through the area variables. I think I can only do logistic regression if I know both the voting decision and ethnicity of the same individual but I don't from the existing data. — KubiK888, Mar 04 '16 at 16:09
You do know that a percentage cannot be >100% or <0%, so linear regression (that has no mechanism of avoiding such predictions) seems like it might well not be the right thing to do. If you always have 1000s of voters and the percentages you are interested are always around 50%, then of course a linear model might be an okay approximation. A vanilla flavor standard logistic regression for individual voter data will indeed not be possible, but that is not necessarily a reason to not consider the underlying distribution (and to e.g. model logit(p) accounting for the distribution of covariates). — Björn, Mar 04 '16 at 16:38
OK, I see what you mean now, you want to use logistic regression as probabilistic output from 0 to 1 as continuous, not discrete binary classes. Let's say I do use logistic regression, I am interested in how the parameters and errors might change. Will the N > E because we have more cities and less n in each city? — KubiK888, Mar 04 '16 at 17:21
And can we infer any things expected to change in terms of a and b? Or is it impossible to know? — KubiK888, Mar 04 '16 at 17:22
I am also unable to find any papers or reference I could read up to learn about this specific topic. — KubiK888, Mar 04 '16 at 17:24
Is there any information in your State data that is from sources other than your city data? Specifically, do your State data include results from towns or unincorporated localities that are not included in the city results? Without knowing that it's hard to provide an answer. Also, could you be a bit more specific about the hypothesis that you are trying to test here? Why do you expect that values of `a` and `b` would differ in the two types of models? — EdM, Mar 06 '16 at 19:53
No the lowest level of data is at the city level, but each city has a corresponding state. I would like to run one regression analysis at the city level, and another another at the aggregated state level. I suspect they change because of the possible correlation within-cluster (aka cities are more alike within the same state than cities from other state due to geographic proximity). But I would like to know how to prove it mathematically. — KubiK888, Mar 06 '16 at 22:24

score 1 · Accepted Answer · answered Mar 07 '16 at 14:13

The coefficients and error terms would almost certainly differ between the state and city level, but for reasons that you probably did not intend.

Coefficients. The way you have posed the problem, you would be taking averages of percentage voting 1 by city without taking population into account. An extreme example shows the danger. Say that there is no racial/ethnic* difference in voting, and that a State has 1000 cities. In the single large city, all 1,000,000 people vote for 1; all 1000 voters in each of the 999 other cities vote for 0. Averaged over people in the State, 1 wins with a bit over 50% of the vote, but 0 gets 99.9% averaged over the cities (without taking population into account).

Error terms. In a linear regression, error terms include everything that the linear model missed: non-linear effects, interactions, important variables missing from the model. So say that the influence of race/ethnicity on voting is different in a rural southern state than a northern urban state. If the model ignores that possibility, then you would expect different errors for within-state versus among-state comparisons. Much of statistics is trying to figure out what lurks within those error terms and to take them into account.

Two messages here. First, although I expect that you intended to take population size into account, the way you posed the model in your question did not. Whether you model your outcome variable as a percentage or as a number of votes, your analysis must include population size. A generalized linear model rather than a standard linear regression is called for.

Second, you need to be a good deal more specific in formulating your model. There are ways to take additional variables, interactions, the fact that each city has its own associated state, and so forth, into account. Even if you choose to develop a simpler model, forcing yourself to consider these additional possibilities will mean that you are at least doing so with your eyes open to them. A solid understanding of different statistical designs, as you might get from an advanced statistics course, would help. See this page as an example of what can be included in a voting/demographic model.

*In US bureaucratic terms, race and ethnicity are distinct concepts. If you conflate them as you do in this question, others reviewing your work will take you much less seriously.

Thanks, I have also come up with "proof"/rationale that they are likely change, but thanks very much for giving your reasoning so I can view it in a different angle. I also agree your point about the generalized than standard linear model in order to capture the within-cluster correlations. — KubiK888, Mar 07 '16 at 17:53

What happens when switched from more aggregated to less aggregated unit in ecological regression

1 Answers1