1

I segmented products by using k-means clustering into 10 clusters with historic data (dispatch data). For new products, I can use some dimension and feature based data (ex: product size, color, budget v.s.). My purpose is matching new products with already existing clusters.

Is this possible?


Thank you for your answers.

Actually I used k-medoids for clustering my data.

As a second step, from your suggests, the best way is, using these clusters as target for classification make sense to me.

But; I have new troubles according to usage of classifiers:

  • Recursive Partitioning and Regression Trees: Only used one feature. Message: Variables actually used in tree construction: [1] ProductCode
  • Random Forest: Error Message: Can not handle categorical predictors with more than 53 categories.
  • KNN: Data must be scaled before usage, but my data has lots of categorical features. So KNN is not suitable for my problem.
  • SVM: I think this classifier wants only 2 features. Error Mesage: contrasts can be applied only to factors with 2 or more levels

So, what classifier should I use?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
ayca_bayraktar
  • 19
  • 1
  • 1
  • 3
  • 1
    Asking for programming advice is off-topic here, perhaps R-help or StackOverflow would be better although a minimal reproducible example would help. If you think you also have a statistical question then please edit your question to clarify it. – mdewey Feb 27 '17 at 14:37
  • yes, it is possible with `predict` – Drey Feb 27 '17 at 14:43
  • Can you give some details for usage of predict? There are lots of sub types for it in help doc. – ayca_bayraktar Feb 27 '17 at 16:26
  • Why can't you use the distance of the new product to the cluster center of the identified clusters to assign it to a group? – discipulus Feb 28 '17 at 05:36
  • See http://stats.stackexchange.com/q/245902/35989 – Tim Mar 03 '17 at 17:06
  • SVM: "I think this classifier wants only 2 features. Error Mesage: contrasts can be applied only to factors with 2 or more levels" SVM is a binary classifier and hence you get this. The classifier needs 2 classes of your target variable and NOT 2 features. Read more [here](https://stackoverflow.com/questions/44200195/how-to-debug-contrasts-can-be-applied-only-to-factors-with-2-or-more-levels-er) KNN: "The data has a lot of categorical features". Did you use categorical variables in k-means clustering? Seems incorrect. Read more [here](https://datascience.stackexchange.com/questions/22/k-means- –  Nov 20 '20 at 17:57

3 Answers3

4

Clustering is done on unlabelled data returning a label for each datapoint. Classification requires labels.

Therefore you first cluster your data and save the resulting cluster labels. Then you train a classifier using these labels as a target variable. By saving the labels you effectively seperate the steps of clustering and classification.

This enables you to use any classification algorithm (Random Forest, SVM, Naive Bayes, ....).

The problematic part of this pipeline is the lack of robustness of the kmeans algorithm. Therefore you will have to evaluate the clustering result and possibly repetitivly perform k-means. Alternatively you could use other clustering algortihms and compare results.

Nikolas Rieble
  • 3,131
  • 11
  • 36
  • If you have cases where the clustering label is uncertail (that is two or more labels competed as being almost as good a label) then you can save that also, coulld be useful in judging the results of your discriminant analysis. Maybe some methods could use that information also! – kjetil b halvorsen Feb 28 '17 at 13:17
  • @kjetilbhalvorsen could you give an example of such a method? – Nikolas Rieble Feb 28 '17 at 13:20
  • Not certain, maybe there exist some "fuzzy discriminant analysis"? – kjetil b halvorsen Feb 28 '17 at 13:20
  • As an afterthought, if you have an discriminant analysis function allowing a "weights" argument, and you have, say, an observation which could equally be assigned to two different cluters, simply duplicate the line for that observation and give both weight $w=0.5$. – kjetil b halvorsen Feb 28 '17 at 13:50
1

Although it is feasible to apply classification incl training, cv and benchmarking using cluster labels obtained by clustering the same data, it should be clear that this is will be "successful" even if you have RANDOM high dimensional data. Benchmarking on the same data that you used for clustering leads to overoptimistic classifier performance estimates.

Therefore, data science purists do not like it.

Similarly, you get overoptimistic p values when you assess differences for data between clusters of some high dim data. So, be careful to use statistical testing for feature selection prior to classifier building - tests can just be regarded as heuristic procedures to prioritize features - p value not be interpreted the usual way.

Eike Staub
  • 11
  • 2
0

Please be mindful about the other comments about if you have labelled data.

So you have clustered your products and found out the centroids. The straight forward way is to (preprocess your test data and to) compute the distance between your test data and centroids. From this you can assign the closest centroid label. Do it yourself, or let other methods do it for you:

x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
xtest <- rbind(matrix(rnorm(10, sd = 0.3), ncol = 2),
           matrix(rnorm(10, mean = 1, sd = 0.3), ncol = 2))

clust2 <- flexclust::kcca(x, 2)
flexclust::clusters(clust2, xtest)

But a little bit of critique: be aware of how kmeans works. Did you standardize your data? (Otherweise you will have uneqaul weights in your distance computation). How do you handle categorical data? Etc.

Besides that, there are several other approaches to cluster analysis.

Drey
  • 894
  • 6
  • 11