I'm trying to strengthen my foundations of statistics, and I thought it would be interesting to turn the problem below into some set of equations instead of a more qualitative treatment as is done in the textbook Statistics by Freedman.
In the Statistics book 4th edition, Simpsons's paradox is brought up in the context of the UC Berkeley gender bias study (Chapter 2, Section 4).
What I've been trying to do is to formulate the problem into some equations. Is the formulation below correct?
My random variables are:
$M$ (Major) has sample space $S_M = \{A, B, C, D, E, F\}$
$G$ (gender) has sample space $S_G = \{Male, Female\}$
$D$ (decision) has sample space $S_D = \{Accepted, Rejected\}$
We also have the following functions: (Do these functions even make sense?)
$c(m, g, d) = $ count of students given major $m \in S_M$, gender $g \in S_G$, and decision $d \in S_D$
$r(m ,g) = \frac{c(m, g, d=Accepted)}{c(m, g, d=Accepted) + c(m, g, d=Rejected)}$ is the ratio of students accepted vs. all students who applied given major $m \in S_M$ and gender $g \in S_G$
We also have the following:
- $P(M,G,D)$ is the joint probability distribution of the above r.v.
- $P(M|G = g) = \sum_{d \in S_D} P(M, D = d |G = g)$ is the probability that a student with gender $g$ applied to major $m$.
- $P(M) = \sum_{d \in S_D, g \in S_G} P(M,G = g, D = d)$
To get the $44\%$ figure that the data shows for total male students admitted, and the $35\%$ figure for the female students, we can calculate the following:
$E_{P(M|G=Male)}[r(M, g=Male)] = 44\%$ for male students
$E_{P(M|G=Female)}[r(M, g=Female)] = 35\%$ for female students
However to calculate the unbiased weighted average, we'd have to calculate the expectation over marginal distribution $P(M)$:
- $E_{P(M)}[r(M, g=Male)] = 39\%$
- $E_{P(M)}[r(M, g=Female)] = 43\%$
Am I correct in my statements above? Any help or guide is much appreciated.