A problem with implementing PCA-guided k-means

Question

I am new to machine learning. I am reading the papers K-means Clustering via Principal Component Analysis and PCA-guided search for K-means. But there are too many mathematical proofs in these papers. I can't understand these papers easily. Can anyone explain this thing in simple words?

Also, I am trying to use Python to experiment with this approach. But the result is far form the paper.

My code is below. I use the The AT&T Face Data Set dataset.

TRAIN_PEOPLE = 40
MAX_ITER = 1000
# k-means starting

print("method       Sum of distances")
pca = PCA(n_components = TRAIN_PEOPLE).fit(X)
pca_result = KMeans(n_clusters=TRAIN_PEOPLE, max_iter = MAX_ITER).fit(pca.transform(X))
kmeans = KMeans(init=pca_result.cluster_centers_.dot(pca.components_), 
     n_clusters=TRAIN_PEOPLE, n_init=1, max_iter = MAX_ITER).fit(X)

print("pca-guided   " + str(kmeans.inertia_))

kmeans = KMeans(init='k-means++', n_clusters=TRAIN_PEOPLE, n_init=1, max_iter = MAX_ITER).fit(X)
print("k-means++    " + str(kmeans.inertia_))


kmeans = KMeans(init='random', n_clusters=TRAIN_PEOPLE, n_init=1, max_iter = MAX_ITER).fit(X)
print("random       " + str(kmeans.inertia_))

Here is the result:

method       Sum of distances
pca-guided   214982166.562   // too high!
k-means++    161294842.543
random       170072750.47

Could anyone explain what's going on here?

Re the first paper see here http://stats.stackexchange.com/questions/183236. Does it help? — amoeba, Sep 29 '16 at 08:16
k-means is randomized. Maybe you were just unluckym Try multiple runs! — Has QUIT--Anony-Mousse, Sep 29 '16 at 19:16
When you transform cluster centroids in the PCA space to the original space, don't you forget to add the variable means? — amoeba, Sep 30 '16 at 23:07

A problem with implementing PCA-guided k-means

0 Answers0