Is there a simple example or a mathematical demonstration of why direct and reverse regression (covariates included) can give contrasting results?

Question

I found that, in examining the gender wage gap, there are cases where direct regression shows that men earn more than women with the same educational level (or qualification measured whatsoever) but men are more educated (or qualified) than women earning the same wage. Formally, if: $Y$ is wage, $E$ education and $G$ an indicator for being men: $Y=\alpha+\beta*E+\gamma*G$, and $E=\alpha^*+\beta^**Y+\gamma^**G$, we would have both $\gamma>0$ (an indicator of discrimination for women, paid less than men with the same educational level) and $\gamma^*>0$ (an indicator of discrimination for men, more educated than women with the same wage). I read that such paradox would not occurr if we could measure productivity without error (being education or measured qualification basically a proxy for it). However, what I'm interested in is (any variable could replace gender, wage and education) a fabricated simple numerical example (I was thinking at $2$ educational level and $2$ income levels, so to have a $2 \times 2 \times 2$ table) where this occurs. Alternatively, also how a mathematical demonstration that such situation could occurr would help.

References: Goldberger, Arthur. "Reverse Regression and Salary Discrimination," J. Human Res., 1984, 19(3), pp. 293-318

@ Gung: I read the paper from Greene. I'd say he doesn't believe in reverse regression. In fact, he shows that (equation 6): $c^*=\frac{(\bar{y}_f-\bar{y})*(1-R^2_{y,x,d})}{1-P}-c$, where $c$ and $c^*$ would be the discrimination coefficients (direct discrimination for $c<0$ and $c^*>0$, reverse discrimination for $c>0$ and $c^*<0$). This means that: $c^*>0 \iff c<\frac{(\bar{y}_f-\bar{y})*(1-R^2_{y,x,d})}{1-P}$. Assuming $c<0$ (direct discrimination found in direct regression), taking $k=-c(>0)$ and $k^*=-c^*$ (direct discrimination for $k^*<0$, reverse discrimination for $k^*>0$), it is: $k^*<0 \iff k> \frac{(\bar{y}-\bar{y}_f)*(1-R^2_{y,x,d})}{1-P}$.

Edited on June 12

Green concludes that "sign and magnitude" (of $c^*$) "may have nothing to do with discrimination".

Given $\bar{y}=\bar{y}_f*P+\bar{y}_m*(1-P)$, it is: $\bar{y}-\bar{y}_f=\bar{y}_f*(P-1)+\bar{y}_m*(1-P)=(\bar{y}_m-\bar{y}_f)*(1-P)$. Thus, $k^*<0 \iff k> \frac{(\bar{y}-\bar{y}_f)*(1-R^2_{y,x,d})}{1-P}=\frac{(\bar{y}_m-\bar{y}_f)*(1-P)*(1-R^2_{y,x,d})}{1-P}=(\bar{y}_m-\bar{y}_f)*(1-R^2_{y,x,d})$.

I'd say that $k$ reflects wage discrimination (in case employees consider qualification to be fully expressed by $E$), $1-R^2_{y,x,d}$ the part of variance of the wage depending on the error term $\epsilon$ (thus, unexplained) and $\bar{y}_m-\bar{y}_f$ the difference in wage, due to both discrimination and higher average education of men. Thus, $\bar{y}_m-\bar{y}_f>k$, and whether it will remain above $k$ even after multiplication by $(1-R^2_{y,x,d})$ depends on:

1) The difference in qualification between men and women (leading to higher $\bar{y}_m-\bar{y}_f$ independently of discrimination).

2) The variance of the error term, leading, in case of men and women earning the same salary, to lower educational level for women vs men (average $\epsilon$ higher for women in the direct regression), given the former have a lower global expected value than the latter).

Then, Greene studies the case where no discrimination was found in the 1st regression, showing that, with women earning less than men, a reverse discrimination would be found: this is due to the fact that, with all wage difference due to different qualification between men and women, in case of people earning the same wage, women have again an average value of the error term in the 1st regression higher than men, thus, a lower average educational level. Compared to the case with discrimination, $k=0$ implies that this will not be "compensated" by wage discrimination in direct regression, so there are no doubts about the sign of $k$.
Finally, he analyzes the case where men and women have the same average qualification, finding that discrimination in the two regressions would, in this case, agree in sign. This is because, with all wage difference due to discrimination, in case of people earning the same wage, women have not only (again) an average value of the error term in the 1st regression higher than men, but also a higher average educational level. Compared to the case with different qualification between men and women, this time there is not a higher average qualification fo men, so men and women will "meet" at a point where women are more qualified, and the same discrimination as in direct regression will be found. However, he says that the coefficient in reverse discrimination would be hard to interpret, because it would be just due to the one in direct regression multiplied by the $R^2$ of the regression of $Y$ on $X$, that should be "evidence against, not in support of, discrimination". The point is that, with no error, we would measure the same discrimination as in direct regression. The higher the variance of the error, the higher the effect "women with higher $\epsilon$" than men, to decrease the estimated effect.

@ Martijn Weterings: It seems to me things are more easily understood if we notice that, with men having a higher educational level wrt women in average, if we compare men and women with the same wage, we are looking at a subset of the observations where women have higher $\epsilon$-values than men (even in absence of wage discrimination). If we had a problem of common support (the highest wage levels only available for men, the lowest one only for women), we could conclude that richest men would have an average $\epsilon$ higher than $0$, and the poorest women lower than $0$ to compensate for that. But, in a common support situation, it seems to me we would have a version of the Simpson's paradox: women would have a higher $\epsilon$ for each wage level, but, in the case of higher-wage groups (associated not only to being men - in case of discrimination, directly - and being more qualified, but also to higher $\epsilon$), we would find a higher number of men, and the other way round for lower-wage groups.

Related: [What is the difference between linear regression on y with x and x with y?](https://stats.stackexchange.com/q/22718/7290) (Note that my answer there cites Goldberger.) — gung - Reinstate Monica, Jun 07 '17 at 19:54
Thank you: indeed your answer links to a paper from Greene, that I downloaded and want to read because it seems to me it does what I asked (giving a mathematical explanation of the paradox). You said that people would draw a line that is somewhere between direct and reverse regression: do you know of any study supporting this, or you just think we would give the same weight to horizontal and vertical distances? — Federico Tedeschi, Jun 08 '17 at 08:38
Maybe this means that the results are nonsensical. You can't eat the cake and have it too: there's simply no discrimination found in data, unless you really want to find it. — Aksakal, Jun 12 '17 at 15:16
I don't get the last @comment. You discuss the gender wage-differences or the statistical effect? Is the example not clear? Its such long-winded talk now, that I do not see whether you changed topic, or you still struggle with the question. The paradox and source of logical error are clear: correlation does not equal causation. If you reverse the model 'E as function of Y' then there's statistically nothing strange about seeing a seemingly opposite effect. See the example. I may add, the causation interpretation of higher E as function of Y for men, is not automatically a disadvantage for men. — Sextus Empiricus, Jun 19 '17 at 10:07
I know that correlation does not equal causation and it's ok for me to be using models that are not "real". I was just interested in the mathematical possibility of the paradox. Given, by construction, $E[\epsilon]=0$, I gave an explanation of why it is possible to have a higher $\epsilon$ for women for any given wage level for men. — Federico Tedeschi, Jun 19 '17 at 12:01

Sextus Empiricus · Accepted Answer · 2017-06-08T20:05:49.627

The example (image + R-code to create it) below may explain the paradox by regression to the mean.

1) Regression coefficient paradox

A requirement is that the group means are different.

The differences in the group means rhyme with the differences in the effects. 'Women having lower education and income than man' => 'man have higher income' (the advantage effect in direct relation), but also 'man have higher education' (the disadvantage effect in reverse relation)

The regression to the mean is the effect of the regression line becoming more 'flat' due to the error. Depending on which type of relation you look at this means that the regression line becomes more vertical or more horizontal.

Now if the group means are different then this will interact with the 'flatter' regression line. In the below example: the smaller regression coefficient, more horizontal/vertical line, will result in a larger parameter for the gender effect (because the genders do not have the same means and are distributed unevenly over the error of the regression to the mean).

(If you would like some help to improve your intuition then you may imagine the relationship with very low correlation. In that case the regression lines become very vertical/horizontal. The regression line is expressing a sum of effect+randomness, it is a prediction of one variable based on the other and the more randomness the closer the prediction is to the average, and thus is smaller than the effect alone. Then the direction depends on the relationship, which is not paradoxical as it is a different situation to predict x based on y then predicting y based on x. Imagine the intuitive idea of very low correlation for that again, in which case the prediction should be closes to the average than to the effect. The regression line coefficient expresses this prediction. The regression line coefficient is not the naked effect-size, without the random effect. See also the R-code example in which the parameter used to generate the data is not the same as the effect size from the model)

2) Table paradox

The code below also generates a table like the Table 1 in your reference. For most education categories males have a higher income (males have an advantage), yet at the same time for most income categories males have a higher education (males have a disadvantage).

This has to do with the genders not being evenly distributed among the classes. Say we look at education e_i which would associate according to the model with salary s_i. The females will have a relatively larger number with lower education than e_i, and the fraction of them that, still make salary s_i, will outweigh the men.

So, on average, in comparison to men, women will have the lower education, for the same salary class, because at a given salary class there are more women that are overpaid than men that are underpaid. And... this is not because women have an advantage. Instead, it is because there is a larger number of low educated women!

Note: other explanations may be non-linearity

library(tidyr)
# generate some random data
# with
# 1) man have higher education levels than women
# 2) man have a bias (the paradox effect occurs if the bias for men is not too large in comparison to the interaction between regression to the mean and unequal distribution)
we <- qnorm(runif(50,0,1),5,2)
me <- qnorm(runif(50,0,1),7,2)
wi <- 30000+we*10000+qnorm(runif(50,0,1),0,20000)
mi <- 30000+me*10000+qnorm(runif(50,0,1),0,20000)+10000

# modelling
data <- list(e=c(we,me), i=c(wi,mi), g=c(rep(0,50),rep(1,50)))
m1 <- lm(i~1+e+g,data)
m2 <- lm(e~1+i+g,data)

# graphical output
plot(data$e, data$i, pch=21, bg=c("pink","lightblue")[data$g+1],xlab="education",ylab="income")

lines(data$e[1:50],predict(m1)[1:50],col="pink")
lines(data$e[51:100],predict(m1)[51:100],col="lightblue")

lines(predict(m2)[1:50],data$i[1:50],col="pink",lty=2)
lines(predict(m2)[51:100],data$i[51:100],col="lightblue",lty=2)

#tabular comparison

r <- aggregate(e~round(i/10000)*g, data=data, FUN="mean")
names(r) <- c("i","g","e")
spread(as.data.frame(r),g,e)

r <- aggregate(i~round(e)*g, data=data, FUN="mean")
names(r) <- c("e","g","i")
spread(as.data.frame(r),g,i)

###################################################

edit June 8 evening

Based on your comments I have adjusted some parameters in the model, in order to make the mathematical demonstration of the paradox a bit more dramatic.

The regression line for income ~ education has the same coefficient as the parameter that was used in the model to create the data.

However, regression/correlation does not equal causation. It is in the switch of meaning from correlation to causation, which can be interpreted in two (or 3) different ways, income as function of education or education as function of income (or a combination), that the paradox occurs. You should pick only one option and if you take multiple interpretations then you get the conflicts (which are only seemingly a paradox/conflict since you made at least one wrong interpretation if you picked multiple options).

The regression lines should be foremost interpreted as predictors.

What is the expected income for a given education and
What is the expected education for a given income.

(The fact that, at the same salary level, women have a lower expected education than men, should not be automatically turned into the interpretation that women have an advantage, which would mean that there is a systematic advantage, that is, a causal effect that gives women an advantage. In this example it is obviously demonstrated as men are explicitly given the advantage.)

And as the new image clearly shows in an intuitive way. $E(y \mid x) \neq E(x \mid y)$. Since you will have approximately the average value for the dependent no matter what the independent value is $E(y \mid x) \sim \bar y$ and $E(x \mid y) \sim \bar x$.

The image may also provide an intuition why, given the same salary, men have a higher education (which may be interpreted as a disadvantage if you see this correlation as a causal effect). The assumption that the correlation 'men have higher education as a function of salary' relates to a disadvantage, is a version of the error that a correlation is seen as a causal effect (and so this paradox is a version of that error too).

n=2000

we <- qnorm(runif(n,0,1),5,2)
me <- qnorm(runif(n,0,1),9,2)
wi <- 8000+we*1000+qnorm(runif(n,0,1),0,10000)
mi <- 8000+me*1000+qnorm(runif(n,0,1),0,10000)+15000

# modelling
data <- list(e=c(we,me), i=c(wi,mi), g=c(rep(0,n),rep(1,n)))
m1 <- lm(i~1+e+g,data)
m2 <- lm(e~1+i+g,data)

# graphical output
plot(data$e, data$i, pch=21, bg=c("pink","lightblue") [data$g+1],xlab="education",ylab="income")

lines(data$e[1:n],predict(m1)[1:n],col="red",lwd=2)
lines(data$e[(n+1):(2*n)],predict(m1)[(n+1):(n*2)],col="blue",lwd=2)

lines(predict(m2)[1:n],data$i[1:n],col="red",lty=2,lwd=2)
lines(predict(m2)[(n+1):(2*n)],data$i[(n+1):(2*n)],col="blue",lty=2,lwd=2)

Thanks a lot: I'll study your R code to have a better understanding of the whole process. I agree with everything you said, apart from the last sentence, that I do not fully understand. You said:"So, on average, in comparison to men, women will have the lower education, for the same salary class, because at a given salary class there are more women that are overpaid than men that are underpaid. And... this is not because women have an advantage. Instead, it is because there is a larger number of low educated women!". Btw: now women are *more* educated than men, so things should be reconsidered — Federico Tedeschi, Jun 08 '17 at 08:56
I've run your code and found out that regression lines behave as expected (highlighting the paradox) and for most categories of income men have a higher education. However, with respect to education, there is a split situation: low-educated women earn more than low-educated men, while high-educated women earn less than high-educated men. I think the problem may be due to randomness, given there are too few men in the low-education categories, and too few women in the high-education one. I saw that using 100 observations per gender the problem disappeared. — Federico Tedeschi, Jun 08 '17 at 10:24
It's 2 effects combined. I regression to the mean: Since in this example the men have higher salary and income they will regress towards higher salary and income, which is seen as contra-dictionary (indeed nowadays the situation may be different). II deterministic model: Note that the men have a systematically higher income as function of the education. This is why you don't see the same in both directions (same size of disadvantage and advantage in the two different pictures). The systematic advantage for men reduces the 'pseudo'-disadvantage for men in the e vs i relationship. — Sextus Empiricus, Jun 08 '17 at 12:14
_"And... this is not because women have an advantage. Instead, it is because there is a larger number of low educated women!"_ For each salary class you have people that are overpaid and underpaid. If somebody is overpaid then it is most likely a woman, not (necessarily) because women have a systematic advantage, but because women with less education than typical for the job are more numerous. And vice versa, if somebody is underpaid then it is most likely a man because there are many more over educated men than over educated women. So this paradox in the tables relates to selection effect. — Sextus Empiricus, Jun 08 '17 at 12:33
Yes: I'd say the presence of error in the equation of wage makes a OLS approach estimate a weaker relationship between wage and education (increase of variance on the explanatory variable) and, given that men are more educated, to an increase in the gender parameter in the reverse regression (in models where we assume wage is based on productivity, that is estimated, this also happens in the direct regression). I noticed that the two lines would basically coincide if we turned average education for men to 6.5, so 7 and 5 are ok to have the 2 regression lines to graphically show the paradox. — Federico Tedeschi, Jun 08 '17 at 13:31
I wonder whether this idea of women overpaid and men underpaid has something to do with ceiling/floor effects (some jobs just give you a pay between Y_min and Y_max): maybe this could a be a case where non-linearity is at work. I think the riddle is: given expected (not in the sense of average) wage (due to education), men have a higher real wage; given real wage, men have a higher expected wage. So, if we say men with an expected wage of 8 earn (on average) 10, and that men with a real wage of 10 have an expected wage of 12 (in terms of educational level), things seem confusing... — Federico Tedeschi, Jun 08 '17 at 13:50
Non-linear effects may indeed be present as well. However, this extends the original question. The 'direct vs reverse' in relation to the data in your reference is mostly this RTTM and misinterpretation of the regression coefficient as the parameter in a deterministic model. In fact, you could calculate from the model $y_i = a x_i + e_i$, given the effect size $a$ and distributions of $x_i$, $e_i$ how your regression coefficient will turn out to be. The regression coefficient is a term in an 'experimental' relation, how to **predict** $y_i$ from $x_i$, and not in the 'deterministic' relation. — Sextus Empiricus, Jun 08 '17 at 15:14
Yes, of course from a direct regression without covariates calculation of the reverse regression coefficient is rather straightfoward. In case of same variance of $X$ and $Y$, the coefficients in the two regressions are even equal — Federico Tedeschi, Jun 09 '17 at 11:43
I mean that, if $Y$ and $X$ have the same variance, considering the two regressions: $Y=\alpha+\beta*X+\epsilon$ and $X=\alpha^*+\beta^**Y+\epsilon^*$, we get: $\beta=\beta^*=Corr(X,Y)$. — Federico Tedeschi, Jun 11 '17 at 13:59

Is there a simple example or a mathematical demonstration of why direct and reverse regression (covariates included) can give contrasting results?

1 Answers1

###################################################

edit June 8 evening

Linked