Misguiding data relationship between Y and X?

Question

Background:

I am currently surveying a number of articles related to subject of urban sprawl. One of the things that I have come across multiple times in the literature (a thing which seems to bother me), is a somewhat strange relationship between the data that have been used to create the dependent and the independent variables.

The relationship:

Imagine that:

The dependent variable (y) = the size of the central city population.
A control variable (x) = the size of the metropolitan population.
... and that the size of the metropolitan population is a sum of central city population and the suburban population (i.e. x = y + (x - y)).

Question:

Does the above-stated relationship not induce some kind of bias (given that the obtained covariance is partially based on the size of y, which ultimately makes x endogenous to y)?

Disclaimer:

Sorry if the question is too simple - I could not find a post with a similar question, nor any textbook examples.

score 3 · Accepted Answer · answered Mar 14 '16 at 12:25

3

Let's start by defining the two seperate subsets;

the metropolitan population, $X_m$
the non-metro population, $X_s$
the total population is then $Y = X_m + X_s$

And then how do you interpret the covariance of $Cov(Y,X_m)$? Well, you can replace $Y$ with it's components, so we then have $Cov(X_m + X_s,X_m)$. This can be rewritten as:

$$Cov(X_m + X_s,X_m) = Cov(X_s,X_m) + Var(X_m)$$

So thinking of misleading examples, the covariance between the metro and non-metro populations could be zero, but the covariance between the total population and the metro population would be positive because $Var(X_m)$ is positive. Also if it happens that the covariance between the metro and non-metro population is negative (e.g. they compete for population) $Cov(Y,X_m)$ could be close to zero, especially if you artificially select a sample that has little variation in $X_m$.

answered Mar 14 '16 at 12:25

Andy W

15,245
8
69
191

I think I see your point - although our shared definition of metropolitan areas, central cities, and suburbs might differ a bit. However, the thing that I am not able to conclude (based on the answer in its current form), is whether the overlap (i.e. the covariation caused by the DIRECT incorporation of y in x) is statistically sound - in particular, when the variation in the central cities (i.e. the core of the metropolitan area) and the suburbs (i.e. the rest of the metropolitan area) is large. – Michael W Mar 14 '16 at 13:05
I don't think I understand your complaints @MichaelWinther. The covariance of the aggregate is defined by the variance and covariance of the subsets. I gave two examples where the aggregate covariance could be misleading if the subsets behave in particular ways. – Andy W Mar 14 '16 at 13:14
The relationship is not biased (and I'm not sure what *statistically sound* means), but it could be misleading in interpreting causal effects - what most people do when they estimate covariances and regression equations. – Andy W Mar 14 '16 at 13:15
Sorry, failed to understand it as examples of such (but I got your point now). Was not intended as a complaint. – Michael W Mar 14 '16 at 13:20

Misguiding data relationship between Y and X?

1 Answers1