How to interpret this PCA biplot to determine which attributes to pick?

Question

I'm running PCA on my dataset which can be found here. There are 6497 instances and 12 attributes with 13th column is the class (ranging from 3 - 9) for wine quality.

I've read what PCA is supposed to do. It combines attributes and shows variance in the data. Finally leading to be able to select few dimensions from a high dimension data that have the most variance.

I'm having trouble finding which attributes to pick after running PCA. How can I find out which attributes can I use and still not lose meaning of my data.

I've created the following biplot from the ggbiplot package

enter image description here

How can I interpret from this biplot which attributes to pick?

I've also created a screeplot for variance.

enter image description here

Looking at the variance screeplot I would say that 3 components would represent the data well. Is this a reasonable assessment to make?

Question

How can I decide which attributes to pick after running PCA
While doing PCA should I leave out the class variable or keep it?

Code to generate the plots

wine <- read.csv("wine_nocolor.csv")
wine1 <- wine[2:13]
wine1.pca <- prcomp(wine1, scale. = TRUE)
library(devtools)
install_github("ggbiplot", "vqv")
ggbiplot(wine1.pca, obs.scale = 1, var.scale = 10, groups = wine$q, ellipse = TRUE, circle = TRUE, title="Biplot for Wine Quality (PC1, PC2)", var.axes=TRUE, varname.size=8)

What do the different variables in the dataset mean? What are you trying to achieve in the end (eg, after you've done everything w/ the PCA)? — gung - Reinstate Monica, Apr 03 '15 at 18:20
@gung After the PCA I'm trying to do two things. One, to cluster the data using k-means. I've alrady used k-means on data that has not been PCA processed. I would like to compare the clustering results on non-PCA processed data vs. PCA processed data. Second, I'll use the components from PCA to classify the wine quality (there is a quality attribute in the dataset) — birdy, Apr 03 '15 at 18:23
@gung the different variables in the dataset are chemical properties of the wine. Properties like total alcohol level, fixed acidity, ph level etc. https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names — birdy, Apr 03 '15 at 18:25
When you say "which attributes to pick", do you mean that you want to select a subset of your original variables (alcohol, acidity, ph, etc.)? — amoeba, Apr 03 '15 at 19:44
@amoeba yea, thats what I mean. After running the PCA I would like to know which attributes (alcohol, acidity, etc) should I keep and which attributes can be ignored. Based on the biplot and screeplot I generated — birdy, Apr 03 '15 at 19:45
If so, you are asking about *feature selection*. We have a good thread on using PCA for feature selection, take a look. — amoeba, Apr 03 '15 at 19:47
Your analysis just showed that **two variables are on a much larger scale than the others**, IMHO. — Has QUIT--Anony-Mousse, Apr 09 '15 at 09:07

How to interpret this PCA biplot to determine which attributes to pick?

0 Answers0

Linked

Related