3

Suppose you've got two high-dimensional datasets with no cases in common and most, but not all, variables in common. To answer your primary research question, you outer-join them, cluster them, and obtain a list of variables that best distinguish each cluster from the rest. For example, suppose the observations are biological cells and variables are gene expression levels; you've identified subpopulations of cells and found marker genes for them. Your two datasets are biological replicates -- cells from two different sources.

You now want to claim that the two datasets give similar clusterings (or not) based on the data. You can cluster each of them separately, but what's a reasonable descriptive statistic or formal test to use in this scenario? The obstacle is that unlike in

Looking for a metric to compare clustering solutions to a reference clustering for a large dataset

, our two clusterings are on separate datasets.

Possibilities:

  • If you're using a probabilistic clustering method, then use a likelihood ratio test or an information criterion to decide whether separate models for each dataset are warranted. Unfortunately, I'm not using one.
  • Duality: the cases are not in common, but the variables (mostly) are. Pretend the sets of cluster markers are your clusterings and use tactics like these. This makes me nervous; can someone either justify it or explain why it's unreliable?
eric_kernfeld
  • 4,828
  • 1
  • 16
  • 41
  • 1
    If you get a new case for data-set 1, would you have a method of assigning it to an existing cluster? Or would you have to re-run the whole clustering algorithm? (Probably not, I think.) So you can consider the combination of clusters + "new data cluster assignment function" as a *classifier* for data of "type = data 1". If the variables for the 2 data sets are in common, then you could run each through the other's classifier, and then use "right vs. wrong" stats to evaluate (e.g. one of [these](https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers)). – GeoMatt22 Sep 29 '16 at 01:05
  • +1, But can you say more about the Duality point? Give a short summary of the linked paper. – ttnphns Sep 29 '16 at 02:49
  • What do you mean when you say they have variables in common? You mean features that describe the elements in each set? – roundsquare Sep 29 '16 at 05:59
  • Yes, features that describe the elements in each set. – eric_kernfeld Sep 29 '16 at 12:19
  • @ttnphns, I called it "duality" because when I have encountered "dual" methods in machine learning, they sometimes take the perspective of pretending that variables are cases and cases are variables. The paper I linked to was probably not the best choice because it deals with cluster stability/robustness for a single dataset, rather than cluster comparisons. The link to the CV question "Looking for a metric..." is probably more useful. – eric_kernfeld Sep 29 '16 at 12:25

0 Answers0