11

I have two finite-sampled signals, $x_1$ and $x_2$, and I want to check for statistical independence.

I know that for two statistically independent signals, their joint probability distribution is a product of the two marginal distributions.

I have been advised to use histograms in order to approximate the distributions. Here's a small example.

x1 = rand(1, 50);
x2 = randn(1, 50);
n1 = hist(x1);
n2 = hist(x2);
n3 = hist3([x1' x2']);

Since I am using the default number of bins, n1 and n2 are 10-element vectors, and n3 is a 10x10 matrix.

My question is this: How do I check whether n3 is in fact a product of n1 and n2?

Do I use an outer product? And if I do, should I use x1'*x2 or x1*x2'? And why?

Also, I have noticed that hist returns the number of elements (frequency) of elements in each bin? Should this be normalized in any way? (I haven't exactly understood how hist3 works either..)

Thank you very much for your help. I'm really new to statistics so some explanatory answers would really help.

Andre Silva
  • 3,070
  • 5
  • 28
  • 55
Rachel
  • 571
  • 2
  • 7
  • 17
  • 1
    A test of correlation between $X_1$ and $X_2$ will give the check you want; if $X_1$ and $X_2$ are independent, their correlation is zero, and e.g. `cor.test()` in R will give an appropriate test; I'm sure there are Matlab commands to do the same. For the plot, one simple approach is to plot several histograms of $X_1$, each using only data where $X_2$ lies in some specified range. If those histograms look different, this suggests dependence. Alternatively, just scatterplot $X_1$ vs $X_2$ and look for a trend; increasing or decreasing ones are easiest to spot. – guest Mar 12 '12 at 00:50
  • @guest Testing for correlation is easy, but independence places many more restrictions on the signals. For independence, pdf(x, y) = pdf(x) pdf(y). Now tn this case, I'm using histograms to approximate the pdf, and would like to know what it actually means to compare the 10x10 matrix to the two 10-element vectors. – Rachel Mar 12 '12 at 09:33
  • 2
    I really think you'd be better off testing correlation, and making a scatterplot. Under the null there is no correlation, so it's a valid test. However, you can use the 10x10 matrix as the input to a Pearson Chi-squared test (`chisq.test()` in R) of independence; the null hypothesis being tested is that the joint distribution of the cell counts in your 2-dimensional contingency table is the product of the row and column marginals. – guest Mar 12 '12 at 16:48
  • Like I said in another comment, I'd opened up [another question](http://stats.stackexchange.com/questions/24439/how-can-i-perform-a-chi-square-test-for-independence-on-signal-samples) asking how to perform a Chi-Square Test on signal samples. It still hasn't been answered so you can go ahead and do that (if you want)! I would particularly like to know how using the 2D-histogram as the Contingency Table would mean that the null hyphothesis being tested is that the joint distribution is the product of the marginals. Thank you. – Rachel Mar 12 '12 at 19:25
  • The 10x10 matrix of counts in each cell are the input you would use to the Pearson $\chi^2$ test. They form a contingency table. The null hypothesis for the Pearson $\chi^2$ test can be (re)stated as saying that the distribution of entries in all columns is the same, regardless of which row you consider; this is equivalent to your definition in terms of products of marginals. (Another definition reverses "rows" and "columns" in what I wrote above). One Matlab coding of the test is in NAG toolbox (http://www.nag.co.uk/numeric/MB/manual_22_1/pdf/G11/g11aa.pdf) – guest Mar 13 '12 at 00:01
  • @guest a $\chi^2$ test for contingency tables is indeed the way to test independence between two (categorical) variables. The null hypothesis of that test is that both variables are independent, which is exactly what the OP wants to test. –  Oct 24 '16 at 05:16

3 Answers3

3

Assuming that the theoretical distributions of $x_1$ and $x_2$ are not known, a naive algorithm for determining independence would be as follows:

Define $x_{1,2}$ to be the set of all co-occurences of values from $x_1$ and $x_2$. For example, if $x_1 = { 1, 2, 2 }$ and $x_2 = { 3, 6, 5}$, the set of co-occurences would be $\{(1,3), (1, 6), (1, 5) , (2, 3), (2,6), (2,5), (2, 3), (2,6), (2,5))\}$.

  1. Estimate the probability density functions (PDF's) of $x_1$, $x_2$ and $x_{1,2}$, denoted as $P_{x_1}$, $P_{x_2}$ and $P_{x_{1,2}}$.
  2. Compute the mean-square error $y=sqrt(sum(P_{x_{1,2}}(y_1,y_2) - P_{x_1}(y_1) * P_{x_2}(y_2))^2)$, where $(y_1,y_2)$ takes the values of each pair in $x_{1,2}$.
  3. if $y$ is close to zero, it means that $x_1$ and $x_2$ are independent.

A simple way to estimate a PDF from a sample is to compute the sample's histogram and then to normalize it so that the integral of the PDF sums to 1. Practically, that means that you have to divide the bin counts of the histogram by the factor $h * sum(n)$ where $h$ is the bin width and $n$ is the histogram vector.

Note that step 3 of this algorithm requires the user to specify a threshold for deciding whether the signals are independent.

naught101
  • 4,973
  • 1
  • 51
  • 85
nojka_kruva
  • 149
  • 5
2

If you are trying to do a test of independence, it's better to use well developed statistics than to come up with a new one. For example, you can start by computing the Chi-squared test. Of course, visualizing the difference between the product of marginals and the joint will give you a good insight, so I encourage you to compute it as well.

Memming
  • 1,570
  • 12
  • 23
  • How do you perform a Chi-Squared Test for Independence on signal samples? I am confused as to how to come up with the required Contingency Table, since the test works on categorical data. I was already thinking of using a Chi-Squared Test and had in fact opened up [another question](http://stats.stackexchange.com/questions/24439/how-can-i-perform-a-chi-square-test-for-independence-on-signal-samples#comment44503_24439). – Rachel Mar 11 '12 at 18:15
  • Also, how do you compare the product of the marginals to the joint pdf? (This was the scope of my question, after all!) Thank you. – Rachel Mar 11 '12 at 22:45
  • @Rachel, you'd use the Chi-Squared test on the binned values. – D.W. Mar 12 '12 at 01:29
  • @D.W. So, for example, to use the Chi-Squared test on two signals, I'd first compute the 2D-histogram, which is a matrix. Let's say it's a 10 x 10 matrix (the number of bins can easily be changed). Then I'd just use the 10 x 10 matrix as the Contingency Table? – Rachel Mar 12 '12 at 09:14
  • 1
    @Rachel, in more detail, you use [Pearson's Chi-squared test of independence](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Test_of_independence) on that $10 \times 10$ contingency table. – D.W. Mar 12 '12 at 09:22
  • 1
    @D.W. Yes, thank you, I know how to perform the test but I was confused as to what to use as the Contingency Table. Since we have only two signals, I wasn't sure whether a 10 x 10 table makes sense. This is in fact the answer to my [other question](http://stats.stackexchange.com/questions/24439/how-can-i-perform-a-chi-square-test-for-independence-on-signal-samples). You can answer in some detail over there if you like, and I'll mark your answer as correct. Thank you. – Rachel Mar 12 '12 at 09:29
  • note that if you are confident the data comes in pairs such as (x_1, y_1), (x_2, y_2),..*that come from a single process Z* then the chi-squared may not make any sense because this test is for *unpaired* data – Quetzalcoatl Feb 02 '19 at 21:05
2

You could compare the joint empirical distribution function with the product of the marginal empirical distribution functions. For two samples $x=(x_1,\dots,x_{n_1})$ and $y=(y_1,\dots,y_{n_2})$, let $n=n_1+n_2$ and define the joint empirical distribution function $$ \hat{F}_n(s,t) = \frac{1}{n} \sum_{i=1}^{n_1} \sum_{j=1}^{n_2} I_{[x_i,\infty)\times[y_j,\infty)}(s,t) \, . $$ The marginal empirical distribution functions of each sample are $$ \hat{G}_{n_1}(s) = \frac{1}{n_1} \sum_{i=1}^{n_1} I_{[x_i,\infty)}(s) \quad \textrm{and} \quad \hat{H}_{n_2}(t) = \frac{1}{n_2} \sum_{i=1}^{n_2} I_{[y_i,\infty)}(t) \, . $$ The idea is to compare $\hat{F}_{n}(s,t)$ with the product $\hat{H}_{n_1}(s)\hat{G}_{n_2}(t)$ using some norm. For example, you could use $$ T(x,y) = \sup_{s,t} \bigg\vert \hat{F}_{n}(s,t) - \hat{G}_{n_1}(s)\hat{H}_{n_2}(t) \bigg\vert \, . $$ If we could know the distribution of $T$ under the hypothesis of independence, then we would have a way to compute a $p$-value for this problem. I don't know how this can be done.

Zen
  • 21,786
  • 3
  • 72
  • 114
  • You're on the same track pursued by Gretton *et al* at http://eprints.pascal-network.org/archive/00004335/01/NIPS2007-Gretton_%5B0%5D.pdf. Also see their slides at http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/ismTalk_%5B0%5D.pdf. – whuber Mar 12 '12 at 17:29
  • I'm not sure how it relates mathematically, but this is basically trying to get at the same thing as [conditional mutual information](https://en.wikipedia.org/wiki/Mutual_information#Conditional_mutual_information), isn't it? – naught101 Oct 24 '16 at 04:28
  • Not the _conditional_ mutual information, just the mutual information. – Stuart Berg Jan 28 '22 at 13:52