Train a classifier on cluster analysis results

Question

I am using HDBSCAN clustering in Python, to cluster my data. Next I want to use those thus labeled data to train a classifier (e.g., random forest) on these. The ultimate objective is to assign the label to a new unseen data. My doubt is, if it would be normal to obtain 100% train and/or test accuracy on the model?

`if it would be normal to obtain 100% accuracy` It is unlikely in most instances unless the classifier you use is isomorphic with respect to its objective function with that of the clusterer you've used. DBSCAN, for example, is not equivalent about its optimization task with the classification tree. — ttnphns, Sep 26 '16 at 11:28
Check please if I understood/edited your question correctly. If not, you can roll back the edit. — ttnphns, Sep 26 '16 at 11:33
See this http://stats.stackexchange.com/q/236964/3277 relevant question about how to assign new objects to old clusters using hierarchical cluster analysis linkage rules. That is the example of classifier which does not establish a classifying rule from the data but uses the rule isimorphic to the used during the (past) clustering. — ttnphns, Sep 26 '16 at 14:50

Haitao Du · Answer 1 · 2016-09-26T11:57:48.923

1

Your whole approach seems problematic: why apply supervised learning on labels from clustering? You are adding additional noise!

In most settings, if you have labeled data, you can build a classification model using supervised learning techniques. If you do not have labeled data, you can run clustering to discover patterns of the data. It is not common to train a model based on labels obtained from clustering.

This is because

We may not sure the clustering results is good enough. There are many parameters in the algorithm (say number of clusters, or cutting threshold in hierarchical clustering), and verifying if the results is good is some separate task.
If we have clustering results, we usually can "classify" future data into clusters based on the clustering results. For example, if you use k-means, you can use the centroid for future classification. I am not familiar with HDBSCAN though.

Finally to your question about 100% on train or test accuracy. I would say it depends on your data and the model (supervised) you used. For example, if your data is "simple" (say linear separable), and your model is capturing your data. Then it is possible to have 100% training and testing accuracy. But for most real world data, it is less likely to get 100% on both.

In real world, anything can happen on testing data, and you never know what you will see in production. So, getting training accuracy too high may not be a good thing for over fitting reasons.

edited Sep 26 '16 at 11:57

answered Sep 26 '16 at 11:49

Haitao Du

32,885
17
118
213

+1, though `not sure the clustering results is good enough` is not a strong point against. What if we see they are good for us? For point `we usually can "classify" future data into clusters based on the clustering results` - agree. This sort of task is "assignment". It is classification which uses the same assignment rule, isomorphic objective function (please see my comment to the question) for new objects as the clustering done before for old, "train" objects. Some clustering algorithms can be implemented to take new objects and assign them to the clusters already formed before. – ttnphns Sep 26 '16 at 12:02
this makes sense for me too, but how can address the second point you mention if i used hdbscan, a method not based on centroids ? Would not train a simple clasiffication model, just predict the new point class label? Independent if we are not sure of the clustering result, im checking that by other methods. – VilionF Sep 26 '16 at 12:02
@VilionF Interesting question and I do not know ... I am asking a more general form of this question now ... – Haitao Du Sep 26 '16 at 12:20
@VilionF, then your further seek might be more technical search around DBSCAN implementations and their options. You can post a separate question asking if somebody knows if this clustering method can be made to assign new objects to the clusters _already_ existing. – ttnphns Sep 26 '16 at 12:26
@ttnphns Let me see if i understand. So if i train a classification model on the clustered data, the "assignment" or "classification" of a new point is only correct if the classifier is isomorphic respect to the clustering method ? Is this mandatory ? – VilionF Sep 26 '16 at 12:30
@VilionF, not mandatory, sure - likely. The logic is obvious. Imagine the simplest example of K-means clustering it assigns points to their nearest centroids (until centroids stabilize). You got the clusters (and know their centroids). Now, if your some "classifier" or "assigner" assigns objects exactly the same or similar rule as K-means then (to cont.) – ttnphns Sep 26 '16 at 12:48
(cont.) ...you can re-assign your old points by it and will get that same clusters ("accurate results"); this will justify to think that the classifier will be "accurate" for new coming objects too. Unless those new objects are atypical relative the old points sample. OK. But if your classifier assigns according to a different rule (say, to medoids, not centroids) then even typical new or old objects will often be classified "incorrectly" - just because the rule is different from that of k-means. – ttnphns Sep 26 '16 at 12:49

score 1 · Answer 2 · answered Mar 20 '17 at 18:57

1

There exist 'prediction' approaches possible with the HDBSCAN* framework by making use of the condensed tree data structure. This is similar to assigning new points to the closest centroid for K-Means. You can read more about it here.

answered Mar 20 '17 at 18:57

Leland McInnes

121
2

This should be the accepted answer - no additional algorithm is necessary, the clustering algorithm already provides a way to categorize new samples! – Thomas Jul 26 '18 at 14:29

score 0 · Answer 3 · answered Sep 26 '16 at 20:57

0

Do not aim for 100% accuracy ever. That indicates you are overfitting.

With labels from clustering, overfitting is really bad, because the labels were not flawless in the first place!

But also: how would you use the answer? There is no automatic question you could answer; clusters are something to study, not to put into production.

answered Sep 26 '16 at 20:57

Has QUIT--Anony-Mousse

39,639
7
61
96

I am working with human antibody structures, i constructed a dataframe NxM , where N is an antiody structure and M are features extracted only from his structural information. Next i applied an exploratory analisis with HDBSCAN, trying to find good structural clusters based on my features proposed. I certainly validated each cluster with other bioinformatics methods. The final task is, based on my approach and features constructed , find the closes human structure to a non-human antibody; in summary input a non-human Ab structure and obtain the closes human one.Maybe knn help on the final step – VilionF Sep 27 '16 at 00:11
If you have validated the clusters this way (and preferrably, you even have found some points incorrectly assigned), then you can proceed as with classification. knn classification is a good candidate (because HDBSCAN worked). You shouldn't get 100% with that, though; in particular not on difficult cases that were incorrectly assigned by HDBSCAN. – Has QUIT--Anony-Mousse Sep 27 '16 at 06:10
But I don't see why you need the clusters for that objective. It's a similarity search objective. – Has QUIT--Anony-Mousse Sep 27 '16 at 06:11
Good point . Clustering the data is kind of a intermediate objective, i can characterize my Data and present different groups of similar structures. Well, when i started to work on this my knowledge on machine learning was Zero haha, i am having fun with this too. Ty for your help – VilionF Sep 27 '16 at 16:46

Train a classifier on cluster analysis results

3 Answers3

Linked