1

I'm running PCA on my dataset which can be found here. There are 6497 instances and 12 attributes with 13th column is the class (ranging from 3 - 9) for wine quality.

I've read what PCA is supposed to do. It combines attributes and shows variance in the data. Finally leading to be able to select few dimensions from a high dimension data that have the most variance.

I'm having trouble finding which attributes to pick after running PCA. How can I find out which attributes can I use and still not lose meaning of my data.

I've created the following biplot from the ggbiplot package

enter image description here

How can I interpret from this biplot which attributes to pick?

I've also created a screeplot for variance.

enter image description here

Looking at the variance screeplot I would say that 3 components would represent the data well. Is this a reasonable assessment to make?

Question

  • How can I decide which attributes to pick after running PCA
  • While doing PCA should I leave out the class variable or keep it?

Code to generate the plots

wine <- read.csv("wine_nocolor.csv")
wine1 <- wine[2:13]
wine1.pca <- prcomp(wine1, scale. = TRUE)
library(devtools)
install_github("ggbiplot", "vqv")
ggbiplot(wine1.pca, obs.scale = 1, var.scale = 10, groups = wine$q, ellipse = TRUE, circle = TRUE, title="Biplot for Wine Quality (PC1, PC2)", var.axes=TRUE, varname.size=8)
amoeba
  • 93,463
  • 28
  • 275
  • 317
birdy
  • 481
  • 8
  • 14
  • What do the different variables in the dataset mean? What are you trying to achieve in the end (eg, after you've done everything w/ the PCA)? – gung - Reinstate Monica Apr 03 '15 at 18:20
  • @gung After the PCA I'm trying to do two things. One, to cluster the data using k-means. I've alrady used k-means on data that has not been PCA processed. I would like to compare the clustering results on non-PCA processed data vs. PCA processed data. Second, I'll use the components from PCA to classify the wine quality (there is a quality attribute in the dataset) – birdy Apr 03 '15 at 18:23
  • @gung the different variables in the dataset are chemical properties of the wine. Properties like total alcohol level, fixed acidity, ph level etc. https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names – birdy Apr 03 '15 at 18:25
  • When you say "which attributes to pick", do you mean that you want to select a subset of your original variables (alcohol, acidity, ph, etc.)? – amoeba Apr 03 '15 at 19:44
  • 1
    @amoeba yea, thats what I mean. After running the PCA I would like to know which attributes (alcohol, acidity, etc) should I keep and which attributes can be ignored. Based on the biplot and screeplot I generated – birdy Apr 03 '15 at 19:45
  • If so, you are asking about *feature selection*. We have a good thread on using PCA for feature selection, take a look. – amoeba Apr 03 '15 at 19:47
  • Your analysis just showed that **two variables are on a much larger scale than the others**, IMHO. – Has QUIT--Anony-Mousse Apr 09 '15 at 09:07

0 Answers0