User segmentation by clustering with sparse data

Question

Imagine that I have 100k users and 1k categories. For each user, up to 5 categories, I know how much money they have spent. Obviously my data is very sparse.

Now I want to group users by the money they spend on different categories. This way, I could group together users who are 'cheap' in some certain categories and 'snobby' in some other categories.

After standardizing the values by calculating the number of times of standard deviation they deviate from the category means, I have tried k-means clustering but I ended up one cluster getting bigger and bigger while others shrink to clusters that contain only few users as the number of iterations k-means do increases.

How can I tackle clustering with sparse data problem? Any pointers, suggestions or ideas are appreciated.

You may first go for a dimension reduction technique like PCA, so that you can group the 1000 category columns into a few components. Then try clustering with the PC's you've chosen. — Vikram Venkat, Mar 02 '16 at 11:50
@Vikramnath Venkatasubramani : It looks close to Canonical Discriminant Analysis, in concept. May I suggest you to reply and add some details on how to do it with a huge dataset? — YCR, Mar 02 '16 at 12:35
Each user has information for only 5 out of the 1k categories? Can you say more about the kind of information that you've got? — Mike Hunter, Mar 02 '16 at 12:39
@DJohnson I only have the transactions of users. So basically, average, standard deviation of purchase values per user for up-to 5 categories out of 1000. — bfaskiplar, Mar 02 '16 at 13:02
What are the categories? How do you users end up in as many as 5? Why not more? — Mike Hunter, Mar 02 '16 at 13:23
@DJohnson, let's say it is 10, not 5. It is essentially still a sparsity problem. — bfaskiplar, Mar 02 '16 at 13:46

score 11 · Accepted Answer · edited Apr 13 '17 at 12:44

$K$-Means is very unlikely to give meaningful clusters on such high dimensional space (see e.g. Curse of Dimensionality).

I agree with the suggestions in the comments: you need to reduce the dimensionality of your data and then do $K$-Means on the reduced space.

However I would not do PCA in the proper way: for PCA you need to do mean normalization, and that will turn a sparse matrix into a dense one. What you can do instead is SVD - without mean normalization - and then apply the clustering algorithm. Also note that Randomized SVD should work fine, but way faster.

Another potentially interesting technique that you can apply in Non-Negative Matrix Factorization. Since your data contains only positive values (if I got it correctly), NMF should suite well for the problem. Also, you can interpret the results of NMF as clustering: when we are doing $n$-dimensional NMF, we can think of the columns of the resulting matrix as clusters, with the value in the cell $i$ being the degree of association of the observation to the cluster $i$.

You can read more about applying NMF for clustering in "Document clustering based on non-negative matrix factorization." by Xu, Wei, Xin Liu, and Yihong Gong (pdf).

User segmentation by clustering with sparse data

1 Answers1