0

I'm trying to strengthen my foundations of statistics, and I thought it would be interesting to turn the problem below into some set of equations instead of a more qualitative treatment as is done in the textbook Statistics by Freedman.

In the Statistics book 4th edition, Simpsons's paradox is brought up in the context of the UC Berkeley gender bias study (Chapter 2, Section 4).

What I've been trying to do is to formulate the problem into some equations. Is the formulation below correct?

My random variables are:

  • $M$ (Major) has sample space $S_M = \{A, B, C, D, E, F\}$

  • $G$ (gender) has sample space $S_G = \{Male, Female\}$

  • $D$ (decision) has sample space $S_D = \{Accepted, Rejected\}$

We also have the following functions: (Do these functions even make sense?)

  • $c(m, g, d) = $ count of students given major $m \in S_M$, gender $g \in S_G$, and decision $d \in S_D$

  • $r(m ,g) = \frac{c(m, g, d=Accepted)}{c(m, g, d=Accepted) + c(m, g, d=Rejected)}$ is the ratio of students accepted vs. all students who applied given major $m \in S_M$ and gender $g \in S_G$

We also have the following:

  • $P(M,G,D)$ is the joint probability distribution of the above r.v.
  • $P(M|G = g) = \sum_{d \in S_D} P(M, D = d |G = g)$ is the probability that a student with gender $g$ applied to major $m$.
  • $P(M) = \sum_{d \in S_D, g \in S_G} P(M,G = g, D = d)$

To get the $44\%$ figure that the data shows for total male students admitted, and the $35\%$ figure for the female students, we can calculate the following:

  • $E_{P(M|G=Male)}[r(M, g=Male)] = 44\%$ for male students

  • $E_{P(M|G=Female)}[r(M, g=Female)] = 35\%$ for female students


However to calculate the unbiased weighted average, we'd have to calculate the expectation over marginal distribution $P(M)$:

  • $E_{P(M)}[r(M, g=Male)] = 39\%$
  • $E_{P(M)}[r(M, g=Female)] = 43\%$

Am I correct in my statements above? Any help or guide is much appreciated.

dd bb33
  • 31
  • 3
  • 1
    i'm not aware of the uc berkeley study you refer to but the same issue was raised in a dutch study of gender and funding: the bias claimed turned out to be a good illustration of simpson's paradox which was explained here: [enter link description here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4687563/) – pau13rown Oct 29 '19 at 15:06
  • @pau13rown The study is described in the [Wikipedia article on Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox#UC_Berkeley_gender_bias). – whuber Oct 29 '19 at 16:36
  • @whuber Thanks for linking to it. What I'm trying to find out is if my equations above are correct? I never had formal training in stats, in case my questions sounds trivial. – dd bb33 Oct 29 '19 at 19:01
  • It would help to use technical terms correctly: that makes it easier for readers to follow and easier for you to refer to textbooks and other helpful accounts. For the meaning of "random variable," for instance, please refer to https://stats.stackexchange.com/questions/50. – whuber Oct 29 '19 at 19:12
  • Thanks for the suggestions. I edited the question to use the technical terms correctly. I hope it makes more sense now. – dd bb33 Oct 30 '19 at 03:16
  • It helps to distinguish *attributes* of objects from *random variables.* One way to model this situation constructs a sample space $\Omega$ out of all applicants to all majors. Each applicant has *attributes* of gender, major, and final decision. From these attributes you can construct (as needed) *random variables* which, by definition, assign definite *numbers* to each individual $\omega\in\Omega.$ Many people visualize (and actualize) this situation by means of a rectangular table in which rows represent individuals and columns represent attributes, putting attribute values into the cells. – whuber Oct 30 '19 at 14:40

0 Answers0