Significance test of improvement of clustering methods compared to reference - correct idea?

Question

Performing cluster analysis, I have a reference for the results and the results of two methods A and B. I am able to calculate fitness metrics (like adjusted mutual information) between the reference and either result.

Please note, that clustering results are not to be confused with classification results, such that in classification 1,1,1,2,2,3,3 would be a different results than 1,1,1,3,3,2,2, but in clustering those would be identical results. Also, there is no constraint, that the methods would return the same number compared with each other or the reference.

I found out that method A is better than method B, but I want to know whether it is statistically significantly better using a significance test (G-test or Chi-squared test). I have an idea, but are not sure whether this is valid:

I consider the reference and the results of the methods to be three variables (R, A, B), and build a three-way contingency table with each cell frequency being $n_{r,a,b}$. I plan to make a chi-squared test, for which I need an expected frequency for each cell

$$ E_{r,a,b} = \underbrace{ \frac{ \sum_{b=1}^{N_B} n_{r,a,b} }{N} }_{\text{marginal probability P(R,A)}} \cdot \underbrace{ \frac{ \sum_{a=1}^{N_A} n_{r,a,b} }{N} }_{\text{marginal probability P(R,B)}} \cdot N \\ N = \sum_{r=1}^{N_R}\sum_{a=1}^{N_A}\sum_{b=1}^{N_B} n_{r,a,b} $$

Then I calculate

$$ G=2 \cdot \sum_{r=1}^{N_R}\sum_{a=1}^{N_A}\sum_{b=1}^{N_B} n_{r,a,b} \cdot \ln \left( \frac{n_{r,a,b}}{E_{r,a,b}} \right), $$

and use it in a chi-squared test with $(N_R-1) \cdot (N_A-1) \cdot (N_B-1)$ degrees of freedom.

Is my approach correct?

To visualize the basic problem, I drew a beautiful graphic, where the y-axis is the goodness of fit:

We can see, that if we would vary the situation slightly (e.g. creating different versions of the dataset, by adding Gaussian noise), we might get this image. We might making a hypothesis test with the Null-hypothesis that the goodness of the methods A and B are identical. We would see that the mean of the goodness of method B is far away from the mean of the goodness of method A, taking the variance of the goodness of method A into account.

However, this is not without issue: If we want to know if method A is significantly better, don't we have to take into account how good the methods are, in other words their absolute goodness? If we consider the difference to R (what I asked above), the we need look at the two "lines" at the right. And there we see that "A compared to R" is not that much better than "B compared to R". In contrast to the idea of the last paragraph, where we compared A to B directly (not considering the difference to R).

Btw, I think this is different than conditional independence, because there, we divide by $\sum_{a=1}^{N_A}\sum_{b=1}^{N_B} n_{r,a,b}$ not by $N$, during the calculation of $E_{r,a,b}$. My approach is equal to calculating the expected frequencies as done in the conditional independence test, then weighting the respective expected frequencies by the marginal probability of R (receiving the expected probabilities) and multiplying by N to get the expected frequencies.

As your A, B, R seem to refer to the same data set, chances are they are all dependent. No test that assumes them independent will work. I don't think any test will do what you want in a valid way. A is better than B for the given data set, but if you want to have a valid test, you need to compare them over different data sets, so that you can assess the variation in the measurements that compare them. Any random model that you can apply on outcomes from a single data set will need to assume some independence that in all likelihood will be violated. — Christian Hennig, Feb 05 '22 at 00:42
@ChristianHennig: Yes, that would be a good approach, however, in some situations I only have this one dataset or results vary considerable between dataset. Well, I know R, A, B, are dependent, you are right. But the question is how much? Because method A is closer to R, according to adjusted mutual information, than method B, I know that method A is more depended on R than B is on R. But how depended is the distributions P(R,A) to P(R,B)? That is what I would like to know. Assuming independence is just a way to build my chi-squared test. — Make42, Feb 05 '22 at 10:22
I edited my answer to address your comments and your diagram. — EdM, Feb 06 '22 at 17:40
@Make42 In order to specify your null hypothesis completely, you need to define a (potentially nonparametric) probability model. You state "Goodness of the methods A and B are identical", but over what "population of data sets"? (Obviously, on your one data set oberved, A is better, so identity will not hold over arbitrary populations.) — Christian Hennig, Feb 08 '22 at 00:41
@ChristianHennig: What I am not understanding is: When I do a regular Chi-squared test from a contingency table, I also test if two distributions $P(X)$ and $P(Y)$ are independent: I test the joint distribution $P(X,Y)$ against the expected "independent" distribution $P(X) \cdot P(Y)$. I also only have one distribution for each random variable. I do not resample multiple $P(X_1)$ ... $P(X_{N_X})$. Here, instead of P(X) and P(Y), I have P(R,A) and P(R,B) - I just made it multivariate. Yes, P(R,A) and P(R,B) are dependent, but when I reject the independence hypothesis, so are P(X) and P(Y). — Make42, Feb 08 '22 at 10:24
1. I do not see how this is different to my situation, once you forget the origin of clustering. 2. Why am I not wrongly assuming independence, in the Chi-squared test? — Make42, Feb 08 '22 at 10:26
@Make42 There are two "independence assumptions" that are important here. The $\chi^2$-test tests independence between an $X$ and a $Y$, so this is "assumed", but only for the null hypothesis, and this is not the problem in your case. However, it also assumes (and not only for the null hypothesis) that the individually counted events on which the $n_{ij}$ in the test statistic are based are independent. This is not the case here, as the classifications of the different observations are dependent on each other. — Christian Hennig, Feb 08 '22 at 10:38
@ChristianHennig: Let me see if I got right, what you are telling me: In the case of a $\mathcal{X}^2$-test of two random variables $X$ and $Y$, I am allowed to assume independence between $P(X)$ and $P(Y)$ (the marginals of the contingency table) when calculating the expected values of the contingency table of the null hypothesis. But what I am not allowed to do, is to assume that the frequencies of the cells $n_{ij}$ and are independent. Is that right? — Make42, Feb 08 '22 at 14:28
Transferring this to my case: 1. I am allowed to assume for the calculation of the expected values (null hypothesis) independence between the marginal distributions $P(R), P(A), P(B)$. 2. I should also be allowed to assume independence between $P(R,A)$ and $P(R,B)$ for the null hypothesis (i.e. the calculation of the expected frequencies), since you said that this is not the issue. 3. Is your critique, that I am wrongly assuming independence between cells $n_{r,a,b}$ of the cube; e.g., $n_{1,2,3}$ and $n_{1,3,2}$? I do not see where I assume this differently than for the X, Y case before. — Make42, Feb 08 '22 at 14:38
@Make42 "But what I am not allowed to do, is to assume that the frequencies of the cells $n_ij$ and are independent. Is that right?" No. What you need to assume is the independence of all the individual events that are counted to make up the $n_{ij}$. This is for sure not fulfilled here. — Christian Hennig, Feb 08 '22 at 14:58
@ChristianHennig: What are the individual events in the case of two random variables $X$, $Y$ with $P(X)$, $P(Y)$ that need to be independent and what is the individual event in the case of three random variables? — Make42, Feb 08 '22 at 20:20
Also, the first test of test for conditional independence in https://online.stat.psu.edu/stat504/lesson/5/5.3/5.3.4 looks very similar to the formula that I am proposing. I try to explain in the last paragraph of my question. — Make42, Feb 08 '22 at 20:26
@Make42 $n_{r,a,b}$ is the count of observations that have $R=r, A=a, B=b$. The assumption is that there's a probability, say $\pi_{r,a,b}$, for an observation to belong to this cell (the page linked in your last comment uses this $\pi$-notation as well), and observations are i.i.d. (particularly independently) assigned to cells according to these probabilities. This is for sure not the case here. — Christian Hennig, Feb 08 '22 at 20:39
@ChristianHennig: An observation would be "For a single data object: to which cluster is it going to be assigned to?" - right? — Make42, Feb 08 '22 at 20:48
@ChristianHennig: So the problem is not if or if not the random variables are dependent or independent of each other, but whether the observation of a single random variable are independent to each other. In my case they are not (which has nothing to do with the fact that I am interested in two methods A and B!), because the two objects are clustered together *because* they are depended. So the problem is not due to my question of comparing two results (three, counting the reference), but in the way clustering works in the first place, destroying the legitimacy of any Chi^2 style test, right? — Make42, Feb 08 '22 at 21:32
Well, this *and* the fact that it isn't informative to reject independence of A vs. R against B vs. R because they are dependent through the same data set on which they are applied anyway. I really suspect you try to impose a significance test here for a problem that isn't the job of a significance test (see also my discussion of bootstrap with @EdM). If you only look at a single data set, your effective sample size for comparing A and B is 1. Not enough for a test. — Christian Hennig, Feb 08 '22 at 21:47
@ChristianHennig: I think, I am starting to understand. I am going down this rabbit hole, because the reviewers of my paper asked whether method A is *significantly* better than method B, where I just showed it is better on average. And then my supervisor suggested that I should *not* repeat the experiments, but instead should look for a significance test on contingency matrices that considers each data object as a single observation, (not the entire clustering result as an observation, as bootstrapping would do). That is how I got here in the first place. — Make42, Feb 09 '22 at 16:27
Finally, I did not realize that, to test independence of two random variables with something like Ch2-test, I need to be sure that the observations within a random variable are independent. Now, I start to question whether bootstrapping is valid: If a sample (= an observation) is gained via resampling from the same base dataset, aren't they dependent, because they come from the same base dataset? Or are they independent after all, because that is just "drawing from the distribution" and the observations are allowed to be depended via the distribution? — Make42, Feb 09 '22 at 16:32
Sorry, I'm going to stop at this point, there's other stuff to do. — Christian Hennig, Feb 09 '22 at 16:44

EdM · Answer 1 · 2022-02-06T17:33:36.720

What you know and don't (yet) know

I found out that method A is better than method B, but I want to know whether it is statistically significantly better using a significance test...

You know that A is better than B when applied to your particular data set. What you don't know is whether that superiority is more than might be expected by chance or that it will continue to hold on new data samples. That is what a significance test can help to evaluate. Refrain from saying things like "I know that A is better" until after you have supported that statement with some kind of significance test.

Below is a brief explanation of why your method is incorrect for this application. Following that are suggestions for two types of such tests, one based on statistics from the data sample and the second based on resampling the data.

Lack of independence, and why your proposal seems to be incorrect

As Christian Hennig said in a comment, "As your A, B, R seem to refer to the same data set, chances are they are all dependent. No test that assumes them independent will work."

The lack of independence comes from the fact that R, A, and B all are based on the same $N$ data points. Proper analysis must take the matching on the same data points into account.

This distinction between matched and unmatched data is more often seen in the distinction between the $\chi^2$ test and McNemar's test. The importance of that distinction can be hard to grasp. Nevertheless, the distinction is crucial. Study that matter carefully. I can't do better than the 2 different explanations on that page by @gung-ReinstateMonica.

The problem with your proposed test is that it is an extension of a classical $\chi^2$ test when an extension of McNemar's test is needed to handle the repeated evaluations on the same $N$ data points.

A possibility suggested by your diagram

Your diagram suggests that you want a single measure to evaluate how close each of A and B come to R and, to evaluate the "statistical significance," how much better A is than B in that respect. That is addressed by measures of inter-rater agreement.

In your situation, with a common reference R and two alternate methods A and B, the agreement of A with R and the agreement of B with R could both be evaluated with Cohen's $\kappa$, a measure of agreement between two raters on the same $N$ data points. That gives a measure of agreement of each of A and B with R, providing a way to take "absolute values of the goodness of fit into account," as you say in a comment.

Furthermore, software that estimates $\kappa$ can also report a large-sample asymptotic normal approximation to its variance. That gives a way to evaluate with statistics from your data set whether A is "significantly" better than B with respect to agreement with R: is $\kappa_{RA}$ significantly different from $\kappa_{RB}$ when the variances of the estimates are taken into account?

A potentially better approach: resampling

Although you might only have 1 data set in hand, you still can effectively "compare the methods over different data sets," as Christian Hennig recommends, by resampling the data you have.

Applying your modeling methods to multiple resamples, e.g. with bootstrapping, allows more robust evaluation of the methods. On this site, this page, this page, and this page discuss resampling applied to cluster analysis. This R-bloggers post on Bootstrap Evaluation of Clusters and this paper on Evaluation of confidence limit estimates of cluster analysis on molecular marker data should provide additional ideas.

A comment suggests that you are concerned about some sort of "conceptual variance" from resampling:

In the case of bootstrapping it is the number objects I resample or - alternatively - how often a single original object is allowed to be re-drawn.

With a standard bootstrap you do not have the flexibility to introduce such a "conceptual variance." The process mimics as closely as possible your sampling of the original $N$ data points from the underlying population.

To evaluate with standard bootstrapping the reliability of estimates based on your original $N$ data points, each resample from your data set contains exactly $N$ data points. To ensure independence among resamples, you resample with replacement. Bootstrapping thus places no limit on "how often a single original object is allowed to be re-drawn." With that specific approach, a large literature documents the validity of bootstrapping in many situations where you can't make assumptions about the forms of underlying probability distributions.

You have an advantage versus unsupervised clustering, as you have a known reference R. By the bootstrap principle, the process of taking bootstrap samples from your original data set mimics taking your original data set from the underlying population. You evaluate the ability of your modeling methods A and B applied to multiple bootstrap samples to work against the reference R in the full data set. That provides estimates of how methods A and B applied to your full data set would work on new samples from the underlying population.

With enough resamples you can get reasonable point estimates and confidence intervals for any adequately well behaved measure of model performance (even something so simple as how frequently A outperforms B when developed on the same bootstrap sample and evaluated on the full data). Validation by resampling also allows to you to evaluate how often each of the methods returns the correct number of clusters, something that your proposal doesn't accomplish.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/133986/discussion-on-answer-by-edm-significance-test-of-improvement-of-clustering-metho). — Sycorax, Feb 09 '22 at 13:35

Significance test of improvement of clustering methods compared to reference - correct idea?

1 Answers1

Linked