1

I am interested if my dataset of questionnaire responses has general patterns. Because my dataset has many variables (questionnaire items), I plan to reduce the number of variables by principal component analysis (PCA) and then perform cluster analysis, such as k-means and hierarchical, based on the principal components to improve the cluster quality.

I will evaluate cluster characteristics by the scores of the original survey questions, not by the principal components.

In this case, should the clustering tendency tests, such as Hopkins, Silverman, Dip, be done on the original data without PCA or on the principal components data? (For clustering tendency tests, please see https://arxiv.org/pdf/1808.08317.pdf or https://en.wikipedia.org/wiki/Hopkins_statistic)

If the data before principal component analysis has a clustering tendency, it seems justified to evaluate the original questionnaire scores on a cluster-by-cluster basis. However, on the other hand, if there is no clustering tendency in the principal components data, it may not be justified to apply cluster analysis to the principal components.

Tom M.
  • 25
  • 5
  • 1
    Can you tell more in your question about the clustering tendency tests? Or leave a link? – ttnphns Jul 01 '21 at 12:37
  • 1
    @ttnphns: Thank you. I edited my question and added the links. I think I should perform clustering tendency tests after performing PCA. The problem is that, in social sciences, many data may not pass the tests. – Tom M. Jul 01 '21 at 22:48
  • 1
    Thanks for the link! I haven't read the article but I preliminary expect that the logically instant answer to your question is "test on the PCs", because it is the PCs that are the immediate data undrgoing cluster analysis. – ttnphns Jul 02 '21 at 08:29
  • @ttnphns: Thank you very much. That's persuasive. I think that the loss of clusterability when PCA is applied to raw data, which is clusterable, is common in social sciences, where data is often not well separated. I think this can be a problem if clustering on principal components is essential for improving cluster quality. This field still looks underdeveloped. – Tom M. Jul 02 '21 at 09:04
  • 1
    I would recommend you also to try internal clustering criteria ([1](https://stats.stackexchange.com/a/358937/3277), [2](https://stats.stackexchange.com/q/195456/3277)). This field is traditional and more developed than the tests. Most of the criteria answer the question "how many clusters (starting from 2 or more) are there". Some, like Gap statistic, addresses also the question whether there are clusters at all. – ttnphns Jul 02 '21 at 09:58
  • 1
    @ttnphns: Thank you. My understanding is that I should use the internal indices to determine the optimal number of clusters after passing the clustering tendency test (clusterability test). If I apply cluster analysis when I cannot reject the null hypothesis in the clustering tendency test, clustering methods, such as k-means, classify data as if there were clusters even if they did not actually exist. In fact, in my case, the internal indices showed the optimal number of clusters, assuming that the data can be classified. – Tom M. Jul 02 '21 at 11:01

0 Answers0