I ran principal components analysis in R on my data. All my regressors are continuous, non categorical variables, except gender which I excluded. I will add it and compare model 1 = PCA to model 2 = PCA + gender and see if its significant.
I determined that PC1/2/3 determine 95% of the variance so I only consider the first 3 columns. All the variables have non-zero values and come up in the order I inputted it into the matrix of regressors.
I used a spree plot to determine I need 3 regressors to explain most of the variance only (which is rather amazing given I have 30+ variables!).
I am confused how I determine which variables I should choose. Do I order the variables per PC column from highest to lowest and see the variables contributing the most? I saw in the Freeway R guide that they actually plot the graphs and compare - similar graphs they take one variable and different looking graphs they subtract the variables from each other. This was rather confusing to me.
Questions:
- Is the ANOVA approach correct in this case for this variable?
- Spree plot - do I look for the first 'kink' or the point where the x axis (no of variables) first goes to 0. Mine jumps dramatically down after 3 variables (to about 0.5) and then curves down to 0 on variable 15. Different websites say different things hence I'm confused.
- Is the approach above correct to write a linear model? Do I write it as $y={\rm variable}_1 + {\rm variable}_2 + {\rm variable}_3$ or do I multiply them or both?