0

I'm a bit confused about the usefulness of reducing dimensions before doing a k-means clustering.

Suppose you want to apply k-means to a set points $(x_i)$ with high dimension. You want to minimize the cost function $\sum_i \|x_i-c_i\|^2$ where $c_i$ is the center of the cluster $x_i$ belongs to.

You have basically two methods:

  • A: do a k-means (Lloyd) directly on $(x_i)$
  • B: reduce the number of dimensions with some dimensional reduction method (such as SVD/PCA), and then apply k-means to the points with reduced dimensions

On the one hand, A is unlikely to find the global minimum, replications will help getting closer to it. It might have a high computational cost due to handling high dimension vectors and many replications.

On the other hand, B is more likely to get close to the global minimum (or even reach it) with fewer replications. But the minimum on the reduced version is not the original minimum (it is however known to be close to it). There is of course an additional computational cost for reducing dimensions.

All in all, is it known that B is better than A in either objective: minimizing the original cost function or computational cost?

Benoit Sanchez
  • 7,377
  • 21
  • 43
  • 1
    Weren't this topic already considered? Did you try a search `k-means dimensionality reduction`? – ttnphns Sep 22 '17 at 10:40
  • 1
    Many similar question were asked. This was the closest I found: https://stats.stackexchange.com/questions/29084/how-to-decide-if-to-do-dimensionality-reduction-before-clustering. This paper almost answers it, but not really: http://www.cameronmusco.com/personal_site/pdfs/masters.pdf – Benoit Sanchez Sep 22 '17 at 11:00
  • Here is a local selection to read first: https://stats.stackexchange.com/q/157621/3277; https://stats.stackexchange.com/q/300105/3277; https://stats.stackexchange.com/q/232500/3277; https://stats.stackexchange.com/q/12853/3277; https://stats.stackexchange.com/q/186184/3277; https://stats.stackexchange.com/q/93488/3277. – ttnphns Sep 22 '17 at 11:32

0 Answers0