2

I am new to machine learning. I am reading the papers K-means Clustering via Principal Component Analysis and PCA-guided search for K-means. But there are too many mathematical proofs in these papers. I can't understand these papers easily. Can anyone explain this thing in simple words?

Also, I am trying to use Python to experiment with this approach. But the result is far form the paper.

My code is below. I use the The AT&T Face Data Set dataset.

TRAIN_PEOPLE = 40
MAX_ITER = 1000
# k-means starting

print("method       Sum of distances")
pca = PCA(n_components = TRAIN_PEOPLE).fit(X)
pca_result = KMeans(n_clusters=TRAIN_PEOPLE, max_iter = MAX_ITER).fit(pca.transform(X))
kmeans = KMeans(init=pca_result.cluster_centers_.dot(pca.components_), 
     n_clusters=TRAIN_PEOPLE, n_init=1, max_iter = MAX_ITER).fit(X)

print("pca-guided   " + str(kmeans.inertia_))

kmeans = KMeans(init='k-means++', n_clusters=TRAIN_PEOPLE, n_init=1, max_iter = MAX_ITER).fit(X)
print("k-means++    " + str(kmeans.inertia_))


kmeans = KMeans(init='random', n_clusters=TRAIN_PEOPLE, n_init=1, max_iter = MAX_ITER).fit(X)
print("random       " + str(kmeans.inertia_))

Here is the result:

method       Sum of distances
pca-guided   214982166.562   // too high!
k-means++    161294842.543
random       170072750.47

Could anyone explain what's going on here?

amoeba
  • 93,463
  • 28
  • 275
  • 317
Fanny ZMN
  • 21
  • 1

0 Answers0