1

Suppose I have a data set of $n$ students, and each student $i$ has two distinct features:

  1. Which school they go to, $S_i$
  2. Which sport they play, $P_i$

No student can play more than one sport or go to more than one school. There are $N$ schools and $M$ sports that can be played. Given this, we can define a list of students $L$ like

$L=[(S_1, P_1), (S_2, P_2), ..., (S_n, P_n)]$.

Now, I want to know the answer to the question:

Which (School, Sport) pairings occur more often than would be expected by chance?

I know that if I just want to know "Are (School, Sport) pairings random?" I could use something like an NxM Fisher Exact test...BUT, I want to know specific pairings.

The obvious solution is $NxM$ 1vAll pairings, but this seems like its going to kill any signal. I'm wondering if there's a better (rigorous) approach.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Danny W.
  • 201
  • 1
  • 5

1 Answers1

1

Your data has the form of a contingency table, schools in rows, sports in coumns, and the count of number of students in each cell. Calculate the expected number in each cell, as you do when calculating the chi-square statistic.

To identify school/sports-combination which differs most from the independence case, you can calculate the contributions to the chi-square statistic, per cell. One way of displaying the data visually could be a mosaic plot, for examples see how to determine significant associations in a mosaic plot.

If the table is really large, correspondence analysis could be helpful.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467