I have a large data matrix that I'm trying to reduce to a reasonably sized basis set. The original matrix is 916x225, and I need to reduce the number of variables (its columns) to around 50, but I want to select those that are the most representative of the complete matrix.
Specifically, I want to find a subset S of size - say - 50 variables from all, which leave the least unexplained variance in a regression of all the other variables on S ("most representative").
My current approach is to perform PCA (prcomp
in R), and get the individual columns that are most associated with each principal component. I assume that the original variable with the largest absolute value for its loading (i.e., the largest absolute value in the rotation matrix for each variable), is thus most representative or most correlated with that PC.
Am I interpreting this correctly? If not, any additional guidance is appreciated.
Update: From the comments below, I wanted to add this clarifying point in order to help focus any discussion on my intent. I apologize that I did not convey it well in the original question.
Essentially I'm looking for a subset S of size - say - L=50 variables from all, which leave the least unexplained variance in a regression of the other variables on S ("most representative"). My hope was that by using PCA, I could find how many PCs are need for, say, 90% of the variance, then choose the variables that are most correlated with each PC.
I thought of brute force search, too, but haven't tried that since I have 225 variables in my original matrix, and 225 choose 50 comes to about 3*e+50. That might take a very long time to compute all those linear models.