1

I am hoping someone can check this code to ensure that I have interpreted the various pieces of PCA correctly. I am trying to figure out a way to identify the leading contributors to the performance of multiple securities. For example, one idea I had was to run a multivariate regression using the securities returns as dependent variables and include things like oil, the dollar, the euro, treasury yields, etc. E.g., SBUX + AAPL + MCD + BAC + TWTR = intercept + oil + dollar + euro + steel + gold + e

I then thought that PCA would probably be better suited for this type of exercise. Here is my code from R. The csv file consists of a matrix of 900 securities and 30 rows (30 daily returns for 900 securities)

FD <- read.csv("U:/Personal Projects/R/Data Files/FD Securities Jan 2015.csv")

#Removes columns with any na values
FD1 <- FD[, sapply(FD, function(x) !any(is.na(x)))]

#removes "zero/constant variance" columns, which I think are NaNs I couldnt erase using is.nan
FD2 <- FD1[,apply(FD1, 2, var, na.rm=TRUE) != 0]

#Calc PCs using singe value decomposition (prcomp). Should data be a correlation matrix of the variables? I get reasonable looking results both ways, i.e., PC1 explains 30-50% of variance, PC2 ~10%-15%, etc.
FD2.pca <- prcomp(cor(FD2), retx = TRUE, scale = TRUE)
summary(FD2.pca)
plot(FD2.pca)

#These are the 'loadings', i.e., coefficients used for each linear combination?
as.matrix(FD2.pca$rotation[,1])

#I think these "scores" are the coefficients of interest, as they incorporate the factor     weightings because the output is pca$rotation * scale (stddev of each factor)
as.matrix(FD2.pca$x[,1]) as.matrix(FD2.pca$x[,2]) as.matrix(FD2.pca$x[,3])     as.matrix(FD2.pca$x[,4])

#Scatterplot of the first two principal components. Not sure if this is right of if $x should be used.
plot(x = FD2.pca$rotation[,1], FD2.pca$rotation[,2], xlab = "PC1", ylab = "PC2", main="Principal Component Analysis:")
user2662565
  • 119
  • 1
  • 5

1 Answers1

4

No, you do not supply a correlation matrix to prcomp(). Instead, you supply the data and if you want to do PCA on the correlation matrix you also pass scale = TRUE, which centres and standardises the variables to zero mean and unit variance.

To draw the data on the first two PCs, you were right to query the use of $rotation; $x should be used for this.

Yes, the "loadings" are the $rotation component.

Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153
  • If I take out prcomp() then the result is 21 PCs, but I thought I was supposed to have as many PCs as I have dimensions (900). I get 900 PCs when I used cor(FD2.pca) – user2662565 Feb 04 '15 at 19:22
  • 2
    Passing in a correlation (or covariance) matrix to `prcomp` is just wrong. `prcomp` uses a singular value decomposition of the data matrix. Another way to do PCA is via an Eigen decomposition of a covariance or correlation matrix; that is what `princomp` does. The SVD of the actual data matrix is preferred. Note that `princomp` won't work in your case as you have more columns (variables) than rows (observations). `prcomp` is fine with such data however. What you are doing to get 900 PCs is *not* PCA, you're doing a PCA of the correlation matrix, not the actual data. – Gavin Simpson Feb 04 '15 at 19:30
  • Ok so I have 21 rows (dates) and 900 columns (variables). Therefore the matrix is not symmetrical since 900 columns > 21 rows. If I create a correlation matrix to get 900 x 900, this is how I have been able to "successfully" run the PCA, but I get 900 PCs. How am I supposed to approach this? – user2662565 Feb 04 '15 at 19:50
  • 1
    You haven't "successfully" done PCA of the data at all. I'm not sure exactly what you have achieved by doing that, but I do know it is not what you want. You can't do the Eigen decomposition method for PCA because `princomp` doesn't handle that as your data are wide. If you want to do PCA of your data, but make each variable contribute the same (as it would if you could do this using an Eigen decomposition of the correlation matrix), then do as I said in the Answer, add `scale = TRUE` to the `prcomp()` call and you are good to go. – Gavin Simpson Feb 04 '15 at 19:59
  • I have done this and receive 21 PCs. My understanding is that I should have as many PCs as I have variables (900). When I run prcomp using the correlation matrix, I get 900 PCs where the first explains 70% of the variance and based on the raw coefficients I observed for each variable, it seemed to be telling the story that I felt it should tell, which is why i wanted to confirm that it was run properly and that I wasn't attaching my own interpretation to the results. When I run with scale = T I get 21 PCs. Why do I only get 21? – user2662565 Feb 04 '15 at 20:02
  • Additionally, how do I interpret the loadings vs. the scores? The largest loading is 0.05, and the smallest is -0.01. Is it correct to use the scores when using the results for prediction? – user2662565 Feb 04 '15 at 20:07
  • 1
    @user2662565: See here as to why you get 21 PCs (which is what you should get!): [Why are there only $n−1$ principal components for $n$ data points if the number of dimensions is larger than $n$?](http://stats.stackexchange.com/questions/123318) – amoeba Feb 05 '15 at 23:22
  • Note for anyone visiting that the argument in `prcomp` is `scale.` not `scale` (period at end). Tried to edit the post but a one character change isn't accepted. – Bryan Hanson Mar 16 '19 at 00:25