1

I apologize in advance if my english isn't too clear. Please feel free to leave a comment and tell me what part doesn't make sense.

I'm currently working on a dataset which contains web data and I have around 7 variables, most of them being categorical ( product name, retailer, retail outlet , etc) and a quantitative variable which represents the price of the product.

I'm trying to apply a clustering algorithm to this data ( hierarchical clustering) in R. First I would like to do a multiple correspondence analysis to visualize the data but I'm not able to do that with the R package FactoMiner because there's too many levels ( ~8000 because this is data for products sold in a entire country) and it cause memory problems.

I don't have experience with working with real life data and the only solutions I've seen online consist of either : -creating dummy variables. I don't think that would be very practical with these many levels - using combine.levels() but I saw this solution in the context of a classification problem and it doesn't make sense for me to use in a clustering problem

Do you know of any solution to this kinds of problems?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
N F N
  • 113
  • 6
  • Correspondence analysis (CA) is a technique to show associations of categories of 2 or more categorical variables on a low-dimensional map. Even if it be possible to perform such analysis when some or all of the variables have thousands of categories - plotting so many points on a perceptual map makes the whole idea useless, the map will be a mess. Instead, you should first do your clustering. Get few clusters, label them, then, if you have some aggregated data for them, do a map for them (not necessarily CA). – ttnphns Feb 14 '16 at 16:29
  • Thanks for replying. Woudn't it be the same issue, but with a smaller dataset? I don't fully understand the last sentence (english problems). – N F N Feb 14 '16 at 16:41
  • Have a look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and links in there. Maybe multiple correspondence analysis with very many levels could use some of those ideas? – kjetil b halvorsen May 17 '17 at 11:10

0 Answers0