I am hoping someone can check this code to ensure that I have interpreted the various pieces of PCA correctly. I am trying to figure out a way to identify the leading contributors to the performance of multiple securities. For example, one idea I had was to run a multivariate regression using the securities returns as dependent variables and include things like oil, the dollar, the euro, treasury yields, etc. E.g., SBUX + AAPL + MCD + BAC + TWTR = intercept + oil + dollar + euro + steel + gold + e
I then thought that PCA would probably be better suited for this type of exercise. Here is my code from R. The csv file consists of a matrix of 900 securities and 30 rows (30 daily returns for 900 securities)
FD <- read.csv("U:/Personal Projects/R/Data Files/FD Securities Jan 2015.csv")
#Removes columns with any na values
FD1 <- FD[, sapply(FD, function(x) !any(is.na(x)))]
#removes "zero/constant variance" columns, which I think are NaNs I couldnt erase using is.nan
FD2 <- FD1[,apply(FD1, 2, var, na.rm=TRUE) != 0]
#Calc PCs using singe value decomposition (prcomp). Should data be a correlation matrix of the variables? I get reasonable looking results both ways, i.e., PC1 explains 30-50% of variance, PC2 ~10%-15%, etc.
FD2.pca <- prcomp(cor(FD2), retx = TRUE, scale = TRUE)
summary(FD2.pca)
plot(FD2.pca)
#These are the 'loadings', i.e., coefficients used for each linear combination?
as.matrix(FD2.pca$rotation[,1])
#I think these "scores" are the coefficients of interest, as they incorporate the factor weightings because the output is pca$rotation * scale (stddev of each factor)
as.matrix(FD2.pca$x[,1]) as.matrix(FD2.pca$x[,2]) as.matrix(FD2.pca$x[,3]) as.matrix(FD2.pca$x[,4])
#Scatterplot of the first two principal components. Not sure if this is right of if $x should be used.
plot(x = FD2.pca$rotation[,1], FD2.pca$rotation[,2], xlab = "PC1", ylab = "PC2", main="Principal Component Analysis:")