How to compare two samples of frequencies with categorical x values where one is subset of the other

Question

I am trying to compare two arrays of frequencies. The second is a subset of the first. I want to know if the second is a representative of the first. The arrays are:

x    All  Subset
a    136   38
5    127   27
b    103   23
1    102   17
6     71   11
2     27    4

I want a test that, where B behaves like A (i.e. can be viewed as A scaled), the test returns a p-value close to 1 (Null hypothesis is that for each point on x axis, percentage frequencies of B are the same of percentage frequencies of A).

I tried to use Chi-squared test but, since I could have missing values, I don't know if the test validity can be compromised.. Data is missing not for technical failures but because in one subset is likely to have only few features (x values) with non-zero frequencies.

Moreover since the size of the second column (the subset) is not fixed, I don't know how to scale the first column in order to obtain a valid p-value (now magnitude order is 10e-80).

Thanks

Chi-square tests generally work well whenever _expected_ frequencies all exceed about 1; the ancient advice about a threshold of 5 still lingers on, despite being rebutted many times. As for your "missing" cell, if there are no relevant items the observed frequency must be entered as 0. Your problem must be converted to a problem about frequencies, not percents. I can't see that Kolmogorov-Smirnov applies to arbitrary categorical variables, as the cumulative distribution is then not uniquely defined. — Nick Cox, Dec 09 '13 at 12:55
KS test assumes continuous distributions. If you apply it to discrete distributions while ignoring that fact it's pretty conservative; you need to simulate the null distribution. "*I decided not to use it because it seems that it need at least a value of 5 in each cell of the array.*" -- did you even read the [first answer](http://stats.stackexchange.com/a/35696/805) of the question you pointed to? Or the second-last comment under the question? — Glen_b, Dec 09 '13 at 14:10
Sorry @Glen_b I misread that answer. I'll change the question accordingly if necessary.. — gc5, Dec 09 '13 at 14:23
Your question is updated, but this point made earlier remains valid if you are considering chi-square: Your problem must be converted to a problem about frequencies, not percents. Also, neither A nor B gives the expected frequencies, or even percents; expected frequencies come from a weighted average of A and B. — Nick Cox, Dec 09 '13 at 15:27
@NickCox Ok, it is wrong.. Moreover I didn't write it in the question, but A is already the weighted average of B and several other columns. I have a matrix with N columns and A is composed by the percentages of the values over all the columns (also over B): in this way I was realizing expected frequencies. Now I'll change percents to frequencies and I come back.. — gc5, Dec 09 '13 at 16:09
In your second sentence, do you mean to end the sentence with 'of the first'? — Glen_b, Dec 09 '13 at 20:34

Glen_b · Answer 1 · 2013-12-10T03:21:36.243

I am trying to compare two arrays of frequencies. The second is a subset of the first.

This makes them dependent. Normally the right thing to do is compare two distinct sets:

x     Subset   Not-in-Subset
a       38        98
5       27       100
b       23        80
1       17        85
6       11        60
2        4        23

If they behave alike, then the subset behaves like the whole. This is fairly simple logic. Call the "not in subset" values "C". If B has the same distribution as C (the null in the test) and B obviously has the same distribution as itself(!), then B has the same distribution as B+C (i.e. A) -- if you require it, I could show it mathematically, but it's rather trivial.

Consider the subset. If the underlying proportions in each category of the subset (the things the sample proportions estimate) were not the same as the remainder, it could not be the as the population as a whole.

I tried to use Chi-squared test but, since I could have missing values, I don't know if the test validity can be compromised.

Can you say more about what is missing and how it arises?

Missingness may be a problem (or may not be a problem) for almost any procedure, depending on its nature.

However, note that categories with ordinary zero counts aren't 'missing values' in the required sense; our data here are counts, if they're just 0's, they aren't missing, you have an observed count of 0.

Moreover since the size of the second column (the subset) is not fixed, I don't know how to scale the first column in order to obtain a valid p-value (now magnitude order is 10e-80).

A chi-square can deal with this in the usual fashion.

1) Ok, just out of curiosity, can you point me to something that explain "If they behave alike, then the subset behaves like the whole." 2) Updated answer for missing value, I don't know if more informations are needed 3) Ok, so I don't scale anything.. — gc5, Dec 09 '13 at 21:24
1) It's simple logic. Call the "not in subset" values "C". If B has the same distribution as C (the null in the test) and B obviously has the same distribution as itself(!), then B has the same distribution as B+C (i.e. A). 2) zero frequencies aren't missing in the usual sense; this should cause no problems, though if the expected values were small you might have to modify the test slightly (this presents little difficulty). 3) correct - the chi-square test itself computes appropriate 'scaling' from the table margin. — Glen_b, Dec 09 '13 at 21:35
Ok, one more question. In scipy http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html I am able to test whether the subset is dependent on the not-in-subset; is this the same thing of testing if the subset is dependent to the whole, and if the subset behaves like the whole? — gc5, Dec 10 '13 at 13:27

score 3 · Accepted Answer · edited Dec 09 '13 at 22:21

3

Fisher's exact test is useful for cases where you WOULD have used a chi-square test but don't know if you will always meet the cell count conditions (automated testing of survey data for example). It should be noted that Fisher's exact test can be a bit more timid about proclaiming significance (it's more conservative than chi-square).

It's part of the stats package in R... see ?fisher.test

edited Dec 09 '13 at 22:21

Nick Cox

48,377
8
110
156

answered Dec 09 '13 at 22:19

Brandon Bertelsen

6,672
9
35
46

Since I want to use it to assess if between two states the last state has a "better" (in the sense it appears more similar in terms of frequencies) than the first state with respect to an array of expected frequencies, I think this could be a good idea to use it. Thanks – gc5 Dec 10 '13 at 12:33
Suppose I have more row categories (as the example). I decided to run N tests (where the two row categories are `category` and `~category`. How to compute a single pvalue for the N tests? I don't know if it's clear.. – gc5 Dec 10 '13 at 12:48
I don't understand your second comment. – Brandon Bertelsen Dec 10 '13 at 15:38
Ok, fisher exact is usually implemented for 2x2 contingency tables (maybe because it is computationally too expensive).. I want to run it on a 6x2 contingency table.. How can I do it? I thought about calculating a separate FE for each of the rows (for each category I compute it with, as rows, `category` and `~category`).. In your opinion is it correct? And if so, how to compute a single pvalue that summarize the results of FE on all categories (rows)? – gc5 Dec 10 '13 at 15:49
Fisher Exact can be run on any 2xn (where n > or = 2) matrix. Fisher's exact test is a test of independence. So yes, it can be used to for each of the rows (columns?). I'm not sure of a way to calculate a single p-value to summarize all categories, I feel like that might be redundant given that they are subsets. – Brandon Bertelsen Dec 10 '13 at 16:04
Yes, sorry. I was referring to my original example where categories are in rows instead of columns. My contingency table is 2x6, with categories on columns. Using part of Glen_b's answer I was thinking to use subset and not-in-subset as observations arrays so pvalues should not be redundant .. – gc5 Dec 10 '13 at 17:40
2

An alternative is to do an exact version of the chi-square test (that is, use a chi-square test statistic, but use its exact permutation distribution conditional on the margins). – Glen_b Dec 10 '13 at 20:13
At the end I used Fisher exact building a contingency table 2x2 for each group, composed by `subset`,`not-in-subset` as rows and `category`, `not-category` as columns.. I don't know if there are flaws in this experimental design.. I avoided to build an unique pvalue. I am accepting this answer but upvoting all answers that leaded me to this choice. @Glen_b I am still a math/stats newbie so I need to read something about exact version of chi-square test.. – gc5 Dec 11 '13 at 14:46
1

The place to begin would be reading about permutation (and randomization) tests. Once you understand permutation tests, the rest is simple application of the principles to a chi-square statistic. See the discussion on the following questions, for example: [Q1](http://stats.stackexchange.com/questions/63863/what-is-the-benefit-of-using-permutation-tests) ... (ctd) – Glen_b Dec 11 '13 at 15:09
2

ctd... [Q2](http://stats.stackexchange.com/questions/43958/permutation-test-in-r) $\quad $ [Q3](http://stats.stackexchange.com/questions/55742/difference-between-randomization-test-and-permutation-test) $\quad $ [Q4](http://stats.stackexchange.com/questions/64212/randomisation-permutation-test-for-paired-vectors-in-r) $\quad $ [Q5](http://stats.stackexchange.com/questions/59638/how-to-choose-the-test-statistic-for-permutation-test) $\quad $ [Q6](http://stats.stackexchange.com/questions/69380/permutation-test-based-in-the-wilcoxon-test) – Glen_b Dec 11 '13 at 15:11
Thanks.. I will surely try to understand this point. Thanks for your effort. – gc5 Dec 11 '13 at 23:59

score 1 · Answer 3 · answered Dec 09 '13 at 14:22

1

Take the Spearman's Rank Correlation for the two columns. Perform the test of significance either using the Permutation test or the Fisher transform as has been defined in the wiki page. this will establish any monotone dependency between the two sets of numbers and is a non-parametric method so no assumptions required about the data.

answered Dec 09 '13 at 14:22

htrahdis

638
5
5

I think it cannot be used because it computes the correlation between ranks, so the correlation between `[1,2,3]` and `[2,3,4]`, and the correlation between `[1,2,3]` and `[2,3,5]` is the same (and I don't want it to). Or did I misunderstand? – gc5 Dec 09 '13 at 15:50

How to compare two samples of frequencies with categorical x values where one is subset of the other

3 Answers3

Linked