How to interpret the clusplot in R

Question

I have plotted the Bivariate Cluster Plot (of a Partitioning Object) using the clusplot from the cluster package. Following is the code for this:

k.means.fit <- kmeans(pima_diabetes_kmean[, c(input$first_model, input$second_model)], 2)
  output$kmeanPlot <- renderPlot({
    # K-Means
    clusplot(
      pima_diabetes_kmean[, c(input$first_model, input$second_model)],
      k.means.fit$cluster,
      main = '2D representation of the Cluster solution',
      color = TRUE,
      shade = TRUE,
      labels = 5,
      lines = 0
    )
  })

The plot shows Component 1 on the x-axis and Component 2 on the y-axis. Attached is the plot below. Does component 1 refers to the Pregnancy and Component 2 refers to Glucose, much like a simple dot plot? I am confused about this.

Also, it says that the two component explain 100% of the point variability, what does that exactly mean?

Moreover, why are the green points in the cluster plot different from the red/black points in the dot plot, although both are plotting the same data? Following is the code for plotting points:

plot(
      pima_diabetes_kmean[, c(input$first_model, input$second_model)],
      col = alpha(k.means.fit$cluster, 0.2),
      pch = 20,
      cex = 3
    )
points(
  k.means.fit$centers,
  pch = 4,
  cex = 4,
  lwd = 4,
  col = "blue"
)

Kozolovska · Accepted Answer · 2017-04-20T11:30:25.783

The clusplot uses PCA to draw the data. It uses the first two principal components to explain the data.

You can read more about it here Making sense of principal component analysis, eigenvectors & eigenvalues.

Principal components are the (orthogonal) axes that along them the data has the most variability, if your data is 2d then using two principal components can explain the whole variability of the data, thus the reason you see 100% explained. If your data is from a higher dimension but has a lot of correlations you can use a lower dimensional space to explain it.

The difference you see between the graphs is plotting them along the PCA components.

x <- mvrnorm(200, c(-2,2), diag(2))
y <- mvrnorm(200, c(2,-2), diag(2))
tot <- rbind(x,y)


par(mfrow = c(2,2))
## Original 
plot(x, ylim = c(-5,5), xlim = c(-5,5), col = 'green', main = 'Original Data')
points(y, col = 'red')

## Cusplot 
kmean.fit <- kmeans(tot, 2)
clusplot(tot, kmean.fit$cluster, main = 'Cusplot')

## PCA plot 
pca.tot <- princomp(tot)
plot(tot[1:200,] %*% pca.tot$loading, ylim = c(-3,3), xlim = c(-6,6), col = 'red', main = 'PCA')
points(tot[201:400, ]%*% pca.tot$loading ,col = 'green')

You can play around, add dimensions and change the correlation structure and see how you get different results.

Red and green are hard for many people to tell apart. – Nick Cox Apr 20 '17 at 11:19 — Nick Cox, Apr 20 '17 at 11:19

How to interpret the clusplot in R

1 Answers1