0

I ran principal components analysis in R on my data. All my regressors are continuous, non categorical variables, except gender which I excluded. I will add it and compare model 1 = PCA to model 2 = PCA + gender and see if its significant.

I determined that PC1/2/3 determine 95% of the variance so I only consider the first 3 columns. All the variables have non-zero values and come up in the order I inputted it into the matrix of regressors.

I used a spree plot to determine I need 3 regressors to explain most of the variance only (which is rather amazing given I have 30+ variables!).

I am confused how I determine which variables I should choose. Do I order the variables per PC column from highest to lowest and see the variables contributing the most? I saw in the Freeway R guide that they actually plot the graphs and compare - similar graphs they take one variable and different looking graphs they subtract the variables from each other. This was rather confusing to me.

Questions:

  1. Is the ANOVA approach correct in this case for this variable?
  2. Spree plot - do I look for the first 'kink' or the point where the x axis (no of variables) first goes to 0. Mine jumps dramatically down after 3 variables (to about 0.5) and then curves down to 0 on variable 15. Different websites say different things hence I'm confused.
  3. Is the approach above correct to write a linear model? Do I write it as $y={\rm variable}_1 + {\rm variable}_2 + {\rm variable}_3$ or do I multiply them or both?
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Dino Abraham
  • 439
  • 5
  • 11
  • 2
    It's "scree plot". See a related post [here](http://stats.stackexchange.com/questions/87198/pca-randomness-of-component). – Scortchi - Reinstate Monica Mar 06 '14 at 12:53
  • 2
    For more comment on the word "scree" than is typical in statistical contexts, see http://www.stata.com/manuals13/mvscreeplot.pdf (scroll to the box on p.6, which I ghost wrote). – Nick Cox Mar 06 '14 at 13:02
  • 3
    Your motive in using PCA appears to be to guide selection of regressors. Fair enough, but that's what the lesser known "principal variables analysis" is all about. See e.g. http://www.sciencedirect.com/science/article/pii/S0167947307000564 – Nick Cox Mar 06 '14 at 13:06
  • 2
    See also http://stats.stackexchange.com/questions/23863/use-of-pca-analysis-to-select-variables-for-a-regression-analysis – Nick Cox Mar 06 '14 at 13:08
  • 2
    To add to Nick, the real benefits of incomplete principal components regression comes from using the PC loadings in the final model. There are many ways, however, to try to approximate the PCs so that approximate PCs can be used in the final model. But avoid anything that uses $Y$. – Frank Harrell Mar 06 '14 at 13:13
  • @Frank Harrell When you say loadings, do you mean scores? – Nick Cox Mar 06 '14 at 13:17
  • Yes - sorry - the linear combination weighted by the loadings. – Frank Harrell Mar 06 '14 at 13:30
  • @FrankHarrell: "avoid anything that uses $Y$": could you expand a bit on that? As it is I'd read it including "do not use PLS" (which is the PCA-analogon I use if I intend to use $Y$ information). – cbeleites unhappy with SX Mar 06 '14 at 14:55
  • Thanks everyone! :) I did PCA but I learnt you can't actually get a 'model' of regressors - i.e. each PC becomes a factor which complicates things. Therefore - other then subsets and stepwise - are there other methods that can help me select variables? Esp ones that take into account multicollinearity and/or interaction terms. – Dino Abraham Mar 07 '14 at 18:31

0 Answers0