1

Performing cluster analysis, I have a reference for the results and the results of two methods A and B. I am able to calculate fitness metrics (like adjusted mutual information) between the reference and either result or I can even test for significant independence between the reference and either result, using the G-test or the Chi-squared test. The G-test can be formulated using mutual information.

Please note, that clustering results are not to be confused with classification results, such that in classification 1,1,1,2,2,3,3 would be a different results than 1,1,1,3,3,2,2, but in clustering those would be identical results.

I found out that method A is better than method B, but I want to know whether it is statistically significantly better using a significance test (G-test or Chi-squared test). But how can I test that? I would have two contingency tables - one for the comparison of the reference with the results of method A and one for the comparison of the reference with the results of method B. I am thinking to see if they are significantly different. If I had classes (as in classification), I could just treat each cell in one contingency table as expected value and the cells of the other table as the observed values and perform the G-test. However, while the reference results are the same for both tables (e.g. the row marginal entropy), the results of the method (e.g. the column marginal entropy) are not the same.

It is not really clear what "observed" values are to be compared with which "expected" values.

Maybe I need interaction information for all that, but I cannot figure out how.

Maybe information gain is an approach.

Another idea is, to use the G-test to calculate the p-value for the hypothesis that a method A/B is independent of the reference. Let's say that values are $p_B$ and $p_A$. Since method A is "more not independent" than method B, $p_B > p_A$. Then one could calculate $p_B - p_A$ and try to interpret that, but I am not sure how, and whether that is actually interpretable.

As one can see, I am somewhat in the dark how to approach this.


I think, but am not sure, that what I am asking is different than https://online.stat.psu.edu/stat504/lesson/5 (and thus https://stats.stackexchange.com/a/147980/83252). I think this, because, there they compare the whole three-way contingency table with the expected-value table of independent frequencies. However, I want to compare only "two sides" of the three-way table with each other (so to speak).

If I get it right, $\chi^2 $ of multidimensional data is about having a three-way table and then checking for mutual (complete) independence of the variables X, Y, Z. At least, this is what the example in stats.stackexchange.com/a/147980/83252 suggests. However, is it possible that my question can be rephrased, as the question, if P(X,Y) and P(X,Z) are independent? Maybe that can be a path to an answer to my original question?

Make42
  • 521
  • 4
  • 17
  • If you "have a reference for the results," how is this different from a classification problem? Please edit your question to show more details of the data you have, what you mean by the "reference for the results" that distinguishes this from a classification problem, the natures of methods A and B, and how they perform relative to that "reference." Please provide that information by editing the question, as comments are easy to overlook and can be deleted. – EdM Feb 03 '22 at 21:27
  • @EdM: I wrote "Please note, that clustering results are not to be confused with classification results, such that in classification `1,1,1,2,2,3,3` would be a different results than `1,1,1,3,3,2,2`, but in clustering those would be identical results." - does that help? – Make42 Feb 04 '22 at 10:28
  • If your "reference for the results" is some type of ground truth against which you can compare methods A and B, then that distinction just has to do with the names associated with the clusters/classes. It wouldn't affect tests of agreement. Is there perhaps some uncertainty in the cluster assignments within your "reference"? Are you specifying 3 clusters a priori in your methods, or are you also trying to estimate the number of clusters with the methods? – EdM Feb 04 '22 at 13:25
  • @Dave: I think not. If I get it right, the other question is about having a three-way table and then checking for [mutual (complete) independence](https://online.stat.psu.edu/stat504/lesson/5/5.3/5.3.1) of the variables X, Y, Z. At least, this is what the example in https://stats.stackexchange.com/a/147980/83252 suggests. However, is it possible that my question can be rephrased, as the question, if P(X,Y) and P(X,Z) are independent? Maybe that can be a path to an answer to my original question? – Make42 Feb 04 '22 at 15:35
  • @EdM: The reference result is - by it's nature - specifying the number of clusters. In general, the reference cannot be considered to be "right", but I act as if. In either case, the reference result is not known to the methods and so it is very likely that, e.g. while the reference result specifies 3 clusters, method A will output 4 and method B will output 5 clusters. Also, having the "wrong naming" is no triviality - e.g. accuracy cannot be calculated without additional knowledge (meaning substantial manual work). – Make42 Feb 04 '22 at 15:42
  • You seem to have a very similar new question posted [here](https://stats.stackexchange.com/q/563108/28500). It doesn't, however, deal with the possibility that methods A and B won't even return the "correct" number of 3 clusters as you note in your comment above. Please decide which of these questions you want to keep open and close the other, or rewrite 1 of them so that the distinction between them is clear. – EdM Feb 04 '22 at 16:50
  • @EdM: Well, this here is an open, the other one is a "yes or no" question. If someone would answer "yes" on the the other one, I could close this here. But if the answer there is "no", then I am none the wiser and the question here would still be unanswered. Not sure what to do now. Also, for both questions, there is not guarantee of any method returning the "correct" number of clusters - I did not mention it there, simply because there it is *not* a constraint, but I will make it explicit. – Make42 Feb 04 '22 at 22:18

0 Answers0