Though Python is used for generating the examples this is not a Python question, links to literature/theory are welcome.
I'm wondering how one would go about determining whether there was a significant difference between the column/row values of a table of proportions.
Given raw data such as:
# output from: dt.sample(10, random_state=1)
# this raw data is provided and can be used as part of a solution
A B W
7 0 0 6.868475
318 2 3 0.675412
452 2 2 3.640888
368 1 3 1.179303
242 0 2 9.078588
429 2 3 10.531222
262 2 2 29.270480
310 2 3 1.181533
318 1 3 3.965531
49 1 0 19.296948
The following weighted crosstab is made:
A 0 1 2
B
0 35.3 27.2 43.2
1 18.0 22.9 19.5
2 26.4 23.1 15.6
3 20.3 26.8 21.7
cell row 1, col 1
contains value 22.9
(percentage), how would I determine whether this percentage is significantly different to columns 0,1
(with values 18.0, 19.5
).
I'm assuming that it's some sort of t-test, but I can't seem to find something that covers this particular case.
I would also be interested in how to compare values between columns. It seems that the question is comparing proportions within groups and between groups?
Edit
I would like to be able to determine which columns are significantly different, not just whether there is a significant difference. So, for row 1 col 1
the result might be col 0
is significantly different but col 2
is not.
Edit 2
If there's anything that is unclear about this question please let me know.
The expected output would be something along the lines of:
A 0 1 2
B
0 35.3 27.2 43.2
2 2 0,1
1 18.0 22.9 19.5
0
2 26.4 23.1 15.6
0,1
3 20.3 26.8 21.7
1 0,2 1
I've just made the above up - but the above is to indicate that there would be, for each element in a row, a test between that element and all of the others.
It shows that the cell row 1
, col 2
is significantly different from and row 2, col 1
Data
Not strictly necessary to the question - just putting the (sloppy) code that generated the above table in case it's of use to anyone in future.
import numpy as np
import pandas as pd
np.random.seed(3)
N = 500
dt_1 = pd.DataFrame({
'A' : np.random.choice(range(3), size = N, p = [0.3, 0.3, 0.4]),
'B' : np.random.choice(range(4), size = N, p = [0.25, .25, .25, .25]),
'W' : np.abs(np.random.normal(loc = 1, scale = 10, size = N))
})
dt_2 = pd.DataFrame({
'A' : np.random.choice(range(3), size = N, p = [0.1, 0.1, 0.8]),
'B' : np.random.choice(range(4), size = N, p = [0.5, .2, .1, .2]),
'W' : np.abs(np.random.normal(loc = 1, scale = 10, size = N))
})
dt = pd.concat([dt_1, dt_2], axis = 0)
dt['W'] = dt['W'].div(dt['W'].sum()).mul(len(dt))
crosstab = dt.groupby("A").apply(lambda g:
g.groupby("B").apply(lambda sg:
round(100 * (sg['W'].sum() / g['W'].sum()), 1)
)
).reset_index(drop=True)
crosstab = crosstab.T
crosstab.columns.name = "A"
```