Applying Multiple Correspondence Analysis when predictors have thousands of levels

Question

I apologize in advance if my english isn't too clear. Please feel free to leave a comment and tell me what part doesn't make sense.

I'm currently working on a dataset which contains web data and I have around 7 variables, most of them being categorical ( product name, retailer, retail outlet , etc) and a quantitative variable which represents the price of the product.

I'm trying to apply a clustering algorithm to this data ( hierarchical clustering) in R. First I would like to do a multiple correspondence analysis to visualize the data but I'm not able to do that with the R package FactoMiner because there's too many levels ( ~8000 because this is data for products sold in a entire country) and it cause memory problems.

I don't have experience with working with real life data and the only solutions I've seen online consist of either : -creating dummy variables. I don't think that would be very practical with these many levels - using combine.levels() but I saw this solution in the context of a classification problem and it doesn't make sense for me to use in a clustering problem

Do you know of any solution to this kinds of problems?

Correspondence analysis (CA) is a technique to show associations of categories of 2 or more categorical variables on a low-dimensional map. Even if it be possible to perform such analysis when some or all of the variables have thousands of categories - plotting so many points on a perceptual map makes the whole idea useless, the map will be a mess. Instead, you should first do your clustering. Get few clusters, label them, then, if you have some aggregated data for them, do a map for them (not necessarily CA). — ttnphns, Feb 14 '16 at 16:29
Thanks for replying. Woudn't it be the same issue, but with a smaller dataset? I don't fully understand the last sentence (english problems). — N F N, Feb 14 '16 at 16:41
Have a look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and links in there. Maybe multiple correspondence analysis with very many levels could use some of those ideas? — kjetil b halvorsen, May 17 '17 at 11:10

Applying Multiple Correspondence Analysis when predictors have thousands of levels

0 Answers0