2

Context: I have a dataset containing instances labeled into different classes, and for all the classes, I have the same set of features. My research question is to identify classes that are more similar to each other.

My initial thought was to compare these classes by estimating the pairwise similarity. And, by pairwise similarity, I mean the similarity matrix between all the classes considered. As bellow:

Similarity matrix for classes A, B, C, D:

   A    B    C    D
A  1.0, 0.3, 0.7, 0.8
B  0.3, 1.0, 0.2, 0.4
C  0.7, 0.2, 1.0, 0.9
D  0.8, 0.4, 0.9, 1.0

Example:

For simplicity, let's consider the iris dataset. And my goal is to find if iris setosa is more similar to iris virginica or to iris versicolor. I want to compute the similarity for each possible pair (a,b) for a,b in (setosa, virginica, and versicolor).

Assume that I have standardized all the features between 0 and 1 universally. Only after standardizing, I separated the iris labeled instances into 3 subsets (X_setosa, X_virginica, X_versicolor), according to their classes. Then, I have generated 3 PCs (PC_setosa, PC_virginica, and PC_versicolor), one for each set s as bellow:

pca_s = PCA(n_components=2)
pca.fit_transform(X_s)
PC_s = pca_s.components_

My questions are:

  1. Does that idea of comparing the PCs (eigenvectors) as a proxy for classes similarity make sense?
  2. How could I compare the PCs structures using the cosine similarity? After some googling, I don't know if its better to compare the loadings or the eigenvectors (PCs).
ttnphns
  • 51,648
  • 40
  • 253
  • 462
revy
  • 129
  • 4
  • (1) What exactly do you mean by "pairwise" similarity? What is being paired? (2) How do you standardize the features: separately in each dataset or universally? (3) Please supply some contextual information and a description of your objectives so that we can understand what your "similarity" is intended to measure and how it might be related to PCA. – whuber Oct 29 '21 at 15:28
  • @whuber, (1) I'm pairing classes within my dataset. (2) universally, before separating the classes. (3) I've modified the question with context. – revy Oct 29 '21 at 16:03
  • Thank you. Why isn't your question, "to find if i. setosa is more similar to i. virginica or to i. versicolor," simply answered by inspecting the row of the distance matrix corrresponding to i.setosa? – whuber Oct 29 '21 at 16:37
  • @whuber Thank you for your suggestion. How could I compute the distance matrix for classes composed by many instances? – revy Oct 29 '21 at 16:55
  • I don't have any idea about that, because it isn't evident what you mean by the "distance" between two classes, especially when they might be represented by samples of different sizes. – whuber Oct 29 '21 at 16:56
  • @whuber I have three classes composed of many instance each. Each instance belongs to a single class. All the instances are represented using the same features set. I want to generate a single representation for each class (e.g. a vector or a matrix) considering all the instances in this class. Then, I want to computer the pairwise similarity between the three classes. If you can't understand the problem, please ask specific questions. – revy Oct 29 '21 at 22:56
  • Please check, I've edited a bit the title and the tags. – ttnphns Oct 31 '21 at 12:14
  • 1
    Your idea about using of PCA is yet unclear to me because you are showing only code and not the results themselves. Please add the details so someone be able to reproduce it and appreciate. – ttnphns Oct 31 '21 at 12:23
  • I *have* asked for specifics, as you can see from the previous comments. The difficulties I have with this question--which still have not been resolved--arise not from lack of understanding, but from an understanding that suggests *myriad* solutions. It would take a large textbook just to describe the different senses in which one might want to compare classes and how one might quantify "similarities" among them. The question thereby invites many different (but plausibly valid) answers. That doesn't work here: please review our [help] concerning that issue. – whuber Oct 31 '21 at 18:28

1 Answers1

0

So far I cannot appreciate your idea about using of PCA here, as it is unclear to me because you are showing only code and not the results themselves, and you are not explaining the idea. Hence, I'm answering to your problem without considering your recipe.

Provided you have features data for every point (instance), you can select a distance (or similarity) measure you like and compute the points by points distance matrix. Then, because you need the distance matrix classes by classes, you have to select a rule how to calculate the set distances from the point distances, that is, to compute a distance between every two groups of points given distances between individual points. These rules majorily coincide with the linkage methods utilized in hierarchical clustering.

There are two ways to technically get k x k class distances (prespecified classes) from a n x n point distances according to the rule selected.

  1. Compute manually or write a program doing it, or find a program doing it. My SPSS macro !KO_ASSCLU does right that job. (So welcome if you use SPSS).
  2. Use a hierarchical cluster analysis program which have the options: a) constraining option to cluster points first in the groups specified, before going on to cluster among the groups, b) option to stop agglomeration and output the distance matrix as it is at the moment of stop. My macro !KO_HIECLU does it.

(To get the macros or to read, what and how they do, go to my web-page, linked to in the Profile, and get the "Clustering" collection.)

Results for Iris dataset, using !KO_ASSCLU:

Euclidean distances between all the data points, fragment of the matrix (features were z-standardized before computing distances):

enter image description here

Distances between the three species (Between-group average linkage rule used):

enter image description here

Distances between the three species (Hausdorff distance linkage rule used):

enter image description here

ttnphns
  • 51,648
  • 40
  • 253
  • 462