Suppose you've got two high-dimensional datasets with no cases in common and most, but not all, variables in common. To answer your primary research question, you outer-join them, cluster them, and obtain a list of variables that best distinguish each cluster from the rest. For example, suppose the observations are biological cells and variables are gene expression levels; you've identified subpopulations of cells and found marker genes for them. Your two datasets are biological replicates -- cells from two different sources.
You now want to claim that the two datasets give similar clusterings (or not) based on the data. You can cluster each of them separately, but what's a reasonable descriptive statistic or formal test to use in this scenario? The obstacle is that unlike in
Looking for a metric to compare clustering solutions to a reference clustering for a large dataset
, our two clusterings are on separate datasets.
Possibilities:
- If you're using a probabilistic clustering method, then use a likelihood ratio test or an information criterion to decide whether separate models for each dataset are warranted. Unfortunately, I'm not using one.
- Duality: the cases are not in common, but the variables (mostly) are. Pretend the sets of cluster markers are your clusterings and use tactics like these. This makes me nervous; can someone either justify it or explain why it's unreliable?