8

I have a vector of size 1x3500, which can be viewed as the 'known distribution'. It is simply a table of counts across 3500 groups (i.e. a contingency table). I also have $N$ other vectors of the same size. I want to compare each of these vectors to the 'known distribution' and see whether there is a statistically significant deviation.

The traditional approach would be to use a $\chi^2$-test. However, some of these tables are highly sparse (i.e. has many cells without any counts), which would violate the assumptions behind the test.

For those interested, the context is to test some hypotheses about my data, which is distributed county-wise across the US (~ 3500 counties). I cannot reduce the dimensionality as that would ruin the analysis.

What are some methods that are appropriate for doing a goodness of fit test on such high-dimensional data with a high degree of sparsity?

Edit: Everything else being equal, I would prefer an easily implementable solution. Python (or perhaps R) would be great.

pir
  • 4,626
  • 10
  • 38
  • 73

3 Answers3

3

I assume that you want to test whether the number of occurences is independent of the county. In this case, may you can try the following:

First, transform your $1\times 3500$ tables to $2\times 3500$ tables (as in this post). Let $T$ be one of the $2\times 3500$ tables you have observed. A test on independence which is independent of the sample size is Fisher's exact test. I don't know how familiar you are with the framework of this test, but roughly speaking it requires to estimate the probability of all $2\times 3500$ that have the same row and column sums as $T$ but a larger $\chi^2$-statistics. For two-rowed tables, there exist efficient algorithms that approximate this $p$-value (have a look at Theorem 4 in this paper). Hope that helps.

EDIT: Since it was not clear that my answer describes an approximative method instead of an exact one, I will extend my anwer.

Again, let $T$ be one of your observed tables and let $n$ be its sample size (that is, the sum of all its entries). Let $\theta\in[0,1]^{2\times 3500}$ be the log-likelihood estimators for the independence model and define for a table $v\in\mathbb{N}^{2\times 3500}$ the $\chi^2$ statistics as

$$\chi^2(v)=\sum_{i=1}^2\sum_{j=1}^{3500}\frac{(v_{ij}-n\cdot\theta_{ij})^2}{n\cdot\theta_{ij}}.$$

Let $\mathcal{F}(T)\subset\mathbb{N}^{2\times 3500}$ be the set of all tables that have the same rows- and column sums than $T$, then the conditional $p$-value of Fisher's exact test is

$$\frac{\sum_{v\in\mathcal{F}(T), \chi^2(v)\ge\chi^2(T)} \frac{1}{\prod_{i=1}^2\prod_{j=1}^{3500}v_{ij}!} }{\sum_{v\in\mathcal{F}} \frac{1}{\prod_{i=1}^2\prod_{j=1}^{3500}v_{ij}!}}$$

Of course, this value is impossible to compute exactly, since the size of $\mathcal{F}(T)$ is humongous. However, it can be approximated efficiently with the following adapted version of the algorithm in this paper. For an observed table $T$, do the following

  1. Initialize with $i=0$, $w=T$
  2. $i=i+1$
  3. get another table $w'$ from $\mathcal{F}(T)$ by applying one step of the algorithm in this paper (Section 4)
  4. With probability $\min\left\{1,\frac{\prod_{ij}w_{ij}!}{\prod_{ij}w'_{ij}!}\right\}$, set $w:=w'$, otherwise let $w$ untouched (that is the Metropolis-Hastings rejection step)
  5. If $\chi^2(w)\ge\chi^2(T)$, then $p_i:=1$, otherwise $p_i=0$
  6. If $|\frac{1}{i-1}\sum_{k=1}^{i-1}p_k-\frac{1}{i}\sum_{k=1}^{i}p_k|>tol$, GOTO (2)
  7. Return $\frac{1}{i}\sum_{k=1}^ip_k$

The output is an estimation of the conditional $p$-value.

Tobias Windisch
  • 542
  • 4
  • 10
  • Sounds interesting. I thought that Fisher's exact test was infeasible for a table of this size. Are any of those algorithms implemented in R or Python? – pir Apr 19 '16 at 15:54
  • @pir: I don't know of any implementations, but their algorithm (Section 4 of their paper) seems to be fairly easy to implement (however, they sample the contingency tables uniformly, so you have to overlay the Metropolis-Hastings algorithm to sample from the right distribution). – Tobias Windisch Apr 19 '16 at 16:21
  • Thanks for the added information. It's not clear to me how this is different from using `fisher.test` in R with `simulate.p.value=True`. Could you expand briefly on that? – pir Apr 22 '16 at 07:49
  • @pir: `fisher.test` only uses the short _basic moves_ to explore $\mathcal{F}(T)$. The adapted version however uses larger moves and explores $\mathcal{F}(T)$ more rapidly (the methods differ in Step 3). – Tobias Windisch Apr 22 '16 at 07:51
3

Instead of an exact test you can do a Chi-squared test based on Monte Carlo simulations. You would make N 2x3500 tables consisting of your "known distribution vector" and each of your N other vectors, and run N tests.

For each test, essentially the test is simulating a bunch of tables according to the null hypothesis that the two distributions are the same, and then comparing the observed table with the simulated tables.

The paper behind the method can be found here: http://www.jstor.org/stable/2984263?seq=1#page_scan_tab_contents

If you use R, the information on the code to do this is available here: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/chisq.test.html

You would set simulate.p.value = TRUE

The limitation is that you cannot have rows or columns composed of only 0's in the table that you feed to the function. Also unless if you use a seed, each time you run the simulation you will get different p-values.

Sheep
  • 504
  • 3
  • 15
  • That sounds perfect. Seeing as my 'known distribution' is not sparse it should not be an issue with the 0's. Do you happen to know if there is a Python implementation? – pir Apr 20 '16 at 08:47
  • Are you certain that the Monte Carlo simulation is robust to the sparsity issue? That is not clear to me from the paper. – pir Apr 20 '16 at 08:52
  • Isn't sampling tables according to the null hypothesis exactly the sampling procedure I have described in [my answer](http://stats.stackexchange.com/questions/207728/goodness-of-fit-test-on-sparse-contigency-tables-with-high-dimensionality/208157#208157)? It would be interesting to see where the difference is between these methods, because both methods try to estimate a quantity by sampling from the same set of contingency tables with MCMC methods. – Tobias Windisch Apr 20 '16 at 09:27
  • @pir The Monte Carlo test is used to approximate exact p-values according to this page here: https://onlinecourses.science.psu.edu/stat504/node/89 And exact tests are typically preferred for sparse tables. – Sheep Apr 20 '16 at 14:40
  • @Tobias Windisch Your answer didn't stress on approximation by simulation, it seemed like you were suggesting a Fisher Exact test and I apologize if I misread your answer. It is not obvious to me that the exact procedures for simulating the tables are the same, but the I suppose the general idea is the same. If you look at the page for the chisq.test function I suggested, I believe they use Patefield's method (1981) to generate the tables, I can't seem to find the actual paper. – Sheep Apr 20 '16 at 16:11
  • @pir I can't seem to find a Python implementation. – Sheep Apr 20 '16 at 16:15
  • It's difficult for me to gauge how similiar the two answers are, but it's very valuable with an already implemented method. – pir Apr 20 '16 at 18:22
  • 1
    Wouldn't it make more sense to use `fisher.test` with `simulate.p.value=True` than the Chi-squared test? – pir Apr 22 '16 at 07:48
  • @pir It probably would be, I wasn't aware of the fisher.test function. – Sheep Apr 22 '16 at 13:44
0

Hmm... wouldn't you use something like the Gini coefficient (or its related methods). From my understanding they are developed for exactly this kind of situation