4

Though Python is used for generating the examples this is not a Python question, links to literature/theory are welcome.

I'm wondering how one would go about determining whether there was a significant difference between the column/row values of a table of proportions.

Given raw data such as:

# output from: dt.sample(10, random_state=1)
# this raw data is provided and can be used as part of a solution

     A  B          W
7    0  0   6.868475
318  2  3   0.675412
452  2  2   3.640888
368  1  3   1.179303
242  0  2   9.078588
429  2  3  10.531222
262  2  2  29.270480
310  2  3   1.181533
318  1  3   3.965531
49   1  0  19.296948

The following weighted crosstab is made:

A     0     1     2
B                  
0  35.3  27.2  43.2
1  18.0  22.9  19.5
2  26.4  23.1  15.6
3  20.3  26.8  21.7

cell row 1, col 1 contains value 22.9 (percentage), how would I determine whether this percentage is significantly different to columns 0,1 (with values 18.0, 19.5).

I'm assuming that it's some sort of t-test, but I can't seem to find something that covers this particular case.

I would also be interested in how to compare values between columns. It seems that the question is comparing proportions within groups and between groups?

Edit

I would like to be able to determine which columns are significantly different, not just whether there is a significant difference. So, for row 1 col 1 the result might be col 0 is significantly different but col 2 is not.

Edit 2

If there's anything that is unclear about this question please let me know.

The expected output would be something along the lines of:

A     0     1     2
B                  
0  35.3  27.2  43.2
    2     2     0,1

1  18.0  22.9  19.5
           0

2  26.4  23.1  15.6
                0,1
                
3  20.3  26.8  21.7
    1    0,2      1

I've just made the above up - but the above is to indicate that there would be, for each element in a row, a test between that element and all of the others.

It shows that the cell row 1, col 2 is significantly different from and row 2, col 1

Data

Not strictly necessary to the question - just putting the (sloppy) code that generated the above table in case it's of use to anyone in future.

import numpy as np
import pandas as pd

np.random.seed(3)

N = 500
dt_1 = pd.DataFrame({
    'A' : np.random.choice(range(3), size = N, p = [0.3, 0.3, 0.4]),
    'B' : np.random.choice(range(4), size = N, p = [0.25, .25, .25, .25]),
    'W' : np.abs(np.random.normal(loc = 1, scale = 10, size = N))
    
})

dt_2 = pd.DataFrame({
    'A' : np.random.choice(range(3), size = N, p = [0.1, 0.1, 0.8]),
    'B' : np.random.choice(range(4), size = N, p = [0.5, .2, .1, .2]),
    'W' : np.abs(np.random.normal(loc = 1, scale = 10, size = N))
    
})

dt = pd.concat([dt_1, dt_2], axis = 0)

dt['W'] = dt['W'].div(dt['W'].sum()).mul(len(dt))

crosstab = dt.groupby("A").apply(lambda g: 
                      g.groupby("B").apply(lambda sg:
                                           round(100 * (sg['W'].sum() / g['W'].sum()), 1)
                                          )
                     ).reset_index(drop=True)

crosstab = crosstab.T
crosstab.columns.name = "A"
```
baxx
  • 738
  • 6
  • 21
  • 1
    Sounds like a two-proportions z-test with all possible comparisons. In R you would first wrangle your dataframe into a tidy (long) format with condition A and B, I would add the proportional max into the grid for ease if memory isn't an issue then do row wise two variances tests (http://www.sthda.com/english/wiki/f-test-compare-two-variances-in-r), which you could extract and add the data for into new columns for the test stat and significance. I'm afraid I cant give an answer in python, as I have only just started with it. – Comte Aug 19 '20 at 11:51
  • 1
    @Comte thanks - would you be using the weights for this or just counts? – baxx Aug 19 '20 at 12:29
  • I would attempt to use the weights/proportions. If I get time I'll write an answer with an example later. Hope it helps. – Comte Aug 19 '20 at 13:50
  • @Comte programming isn't integral for answering this btw, it's a stats question - the python code is just what i used to construct the example – baxx Aug 20 '20 at 16:15
  • @Comte any thoughts / links for this? – baxx Oct 18 '20 at 01:04
  • @Comte just wondering if you ever had chance to draft something up (R would be fine) – baxx Aug 31 '21 at 23:49

1 Answers1

1

A $t$-test will not work in this case because each column sums to 100%. The typical way to test equality is with a chi-square test: $$ X^2 = \sum_i^I\frac{(\text{expected #}-\text{observed #})^2}{\text{expected #}}. $$ Since you have frequencies instead of proportions, you need to multiply by the number of observations $N$: $$ X^2 = N\sum_i^I\frac{(\text{expected %}-\text{observed %})^2}{\text{expected %}}. $$

In these cases, the test statistic $X^2$ has a $\chi^2$ distribution with $I-1$ degrees of freedom (since the frequencies have to sum to 1).

In your case, your test statistic to compare column 0 and column 1 would be: $$ \begin{align} X_{01}^2 &= N\frac{(0.353-0.272)^2}{0.353} + \frac{(0.180-0.229)^2}{0.180} + \frac{(0.264-0.231)^2}{0.264} + \frac{(0.203-0.268)^2}{0.203} \\ &= N\cdot 0.0568631. \end{align} $$

The chi-square quantile for a 5% test would be qchisq(p=0.95, df=3)=7.81.

If your $N=100$, then $X_{01}^2$=5.67 and we would reject that column 0 and column 1 were different.

Unfortunately, you seem to want to test all of the columns against one another. In that case, you should adjust the level at which you test: to conclude significance at a 5% level, you would need to compare your test statistics to a 5/3% level: qchisq(1-0.05/3, df=3)=10.24.

Your other test statistics: $$ \begin{align} X_{02}^2 &= N\frac{(0.353-0.432)^2}{0.353} + \frac{(0.180-0.195)^2}{0.180} + \frac{(0.264-0.156)^2}{0.264} + \frac{(0.203-0.217)^2}{0.203} \\ &= N\cdot 0.0640772, \qquad \text{and} \\ X_{12}^2 &= N\frac{(0.272-0.432)^2}{0.272} + \frac{(0.229-0.195)^2}{0.229} + \frac{(0.231-0.156)^2}{0.231} + \frac{(0.268-0.217)^2}{0.268} \\ &= N\cdot 0.0568631. \end{align} $$

For $N=100$ none of these columns would be deemed significantly different at a 5% level.

I am a little wary of testing the rows since those do not add to 100% so it is not clear what testing rows would mean nor if it is sensible.

kurtosis
  • 1,560
  • 2
  • 14
  • Thanks - but for "* In your case, your test statistic to compare column 0 and column 1 would be", note that it's a comparison between col 0 and col 1 _for a particular row. So the comparison would be between col 0 and col 1 for some row `i`, and this would be carried out for each row (and each column paring). I'm also not sure if you've taken the weights into account (if not, why not?). It's possible, if easier, to carry out the test using the raw data (rather than multiplying by `N`, it's just that the final result it a crosstabulation) – baxx Aug 20 '20 at 19:15
  • Oh... You want to compare two cells? Without estimates of each cell's mean and variance, that is not possible. Given your data, I'm not sure you can estimate all of the variances. – kurtosis Aug 20 '20 at 19:26
  • Compare two cells, yes (within a row). The mean and variance of a cell can be computed from the raw data can't it? Why not? The raw data is available -- the final result is the crosstab – baxx Aug 20 '20 at 20:40
  • "Given raw data", I'm saying that the raw data is available, so why can't mean and variance be computed from this? – baxx Aug 20 '20 at 20:41
  • 1
    Ah, in that case you can do so. That wasn't clear and you mentioned the python was not crucial so I did not read it. Apologies. – kurtosis Aug 20 '20 at 21:03
  • Sorry about that - I've tried to make the post as reproducible as possible by giving code to reproduce etc. The raw data is what is delivered though, the crosstab is produced, the significance test is required. There should be no need to read the python at the bottom of the post, I will make the python at the top more obvious now though – baxx Aug 20 '20 at 21:23