2

I have a large data matrix that I'm trying to reduce to a reasonably sized basis set. The original matrix is 916x225, and I need to reduce the number of variables (its columns) to around 50, but I want to select those that are the most representative of the complete matrix.

Specifically, I want to find a subset S of size - say - 50 variables from all, which leave the least unexplained variance in a regression of all the other variables on S ("most representative").

My current approach is to perform PCA (prcomp in R), and get the individual columns that are most associated with each principal component. I assume that the original variable with the largest absolute value for its loading (i.e., the largest absolute value in the rotation matrix for each variable), is thus most representative or most correlated with that PC.

Am I interpreting this correctly? If not, any additional guidance is appreciated.

Update: From the comments below, I wanted to add this clarifying point in order to help focus any discussion on my intent. I apologize that I did not convey it well in the original question.

Essentially I'm looking for a subset S of size - say - L=50 variables from all, which leave the least unexplained variance in a regression of the other variables on S ("most representative"). My hope was that by using PCA, I could find how many PCs are need for, say, 90% of the variance, then choose the variables that are most correlated with each PC.

I thought of brute force search, too, but haven't tried that since I have 225 variables in my original matrix, and 225 choose 50 comes to about 3*e+50. That might take a very long time to compute all those linear models.

KirkD_CO
  • 1,013
  • 1
  • 6
  • 17
  • 2
    You (along with `prcomp in R`) incorrectly applies the word "[loading](http://stats.stackexchange.com/q/143905/3277)" for the eigenvector (rotation) matrix. As for your question about an assosiation, it is answered [here](http://stats.stackexchange.com/q/119746/3277). – ttnphns Sep 16 '16 at 03:36
  • In prcomp, the return value provides a matrix called 'rotation'. Is that not a loadings matrix, but rather an eigenvalue matrix? – KirkD_CO Sep 16 '16 at 04:18
  • Yes, it is so... – ttnphns Sep 16 '16 at 04:39
  • Ah, the silly help text is wrong! – KirkD_CO Sep 16 '16 at 04:57
  • Only you misprinted: eigenvector matrix (not eigenvalue matrix) – ttnphns Sep 16 '16 at 05:02
  • 1
    Hmm, I might misunderstand the question. But I think, you are looking (combinatorically) for a subset S of size - say - L=50 variables from all, which leave the least unexplained variance in a regression of the other variables on S *("most representative")* ? If I got this correctly then a brute force checking all combinations of subsets should be "sufficient". I do not see immediately, how a PCA could be used to reduce the effort of the brute-force solution, but perhaps this is possible - before thinking deeper about it I'd like to know whether I got your optimization target correctly? – Gottfried Helms Sep 20 '16 at 14:23
  • 1
    Yes, you have the interpretation exactly: a subset S of size - say - L=50 variables from all, which leave the least unexplained variance in a regression of the other variables on S ("most representative"). My hope was that by using PCA, I could find how many PCs are need for, say, 90% of the variance, then choose the variables that are most correlated with each PC. I thought of brute force, too, but haven't tried that since I have 225 variables in my original matrix, and 225 choose 50 comes to about 3*e+50. That might take a very long time to compute all those linear models. – KirkD_CO Sep 20 '16 at 15:12
  • Hmm, trying to find some shortcutting to reduce the full combinatorical search I looked at the determinants . It seems, that the separation of the full set S of all variables into subsets W for the wanted and U for the unwanted variables, such that most of the variance in the correlation matrix (defined by S) is explained by the submatrix defined by W, occurs where also the sum of the determinants of the two sub-correlationmatrices (defined by the W and U subsets) is maximal. (Well, this does not yet help much, but I'm looking at further such observations in the hope to find shortcuts) – Gottfried Helms Sep 20 '16 at 23:40
  • One possibility would be to do a greedy forward selection search - find the one variable that best reproduces the rest and add it to W. Then, of the remaining, find the one that in combination with the first best reproduces the rest, add it to W. Continue adding to W stepwise. I typically don't like the greedy search approach, but it isn't likely that there are significant non-linearities to worry about. – KirkD_CO Sep 21 '16 at 00:05
  • With that form of a greedy algorithm we might run in local minima, which are not the global minimum; I've looked with this idea in mind already at a synthetical example of #S=7 variables and #W=3 and #U=4 (where "#" means "number of variables in ...") having also 3 significant pc's . – Gottfried Helms Sep 21 '16 at 00:52
  • Yes, that was exactly my concern. Would be fast, but very prone to local minima. – KirkD_CO Sep 21 '16 at 02:58
  • Some approximation to a good result is perhaps, to do PCA with quartimax-rotation (or Promax) on the leading pc's. Then on each quartimax/promax-factor one has clouds of variables wich load (absolutely) high. Then take from each cloud that one which is "sharpest" across the factors. Take this (and possibly some variations) as initial solutions and look for the best of that initials. I'm sure, there is a lot of literature on this but I didn't get an idea for a good searching phrase although it is a standard task in scale-generating/-analyzing. – Gottfried Helms Sep 21 '16 at 05:53
  • P.s. I don't think this a a duplicate of the other question, which might be more obvious if the title of the question gets more focused... – Gottfried Helms Sep 21 '16 at 05:54
  • Perhaps `feature-selection` is a tag with better related questions/answers (I added this tag to the questions tags) – Gottfried Helms Sep 21 '16 at 08:13
  • ANother link might be useful http://www.cfe-csda.org/erricos/Papers/CSDA07.pdf on "Efficient algorithms for computing the best subset regression models for large-scale problems" – Gottfried Helms Sep 21 '16 at 13:05
  • @GottfriedHelms, I feel that your edit of the title was a bit radical, so I just brought it slightly back to keep the word "association". It is true that variable selection is implied here but variable selection is somewhat wider topic. Will you mind? – ttnphns Sep 22 '16 at 08:59
  • @ttnphns Based on the comments (I have now upvoted two comments, one by Gottfried and one by OP, that seem most relevant), it seems that Gootfried's version of the title was closer to what OP is really after. – amoeba Sep 22 '16 at 12:52
  • In any case, I have voted to reopen (cc @GottfriedHelms). – amoeba Sep 22 '16 at 12:53
  • Thanks @amoeba - for the reopen-vote as well as for the rating of the title-proposal. On the other way this becomes now too much hazzle for me for this small problem (and my "spurious" engagement in SSE anyway) If I'll have something new of value I'll put it there at the OP's question. – Gottfried Helms Sep 22 '16 at 13:25
  • As you like, gentlemen, I don't object. – ttnphns Sep 22 '16 at 13:52
  • @ttnphns. I leave it to the OP to edit this post further if they have interest in it. – amoeba Sep 22 '16 at 17:57
  • @amoeba - certainly interest. I appreciate reopening the post. – KirkD_CO Sep 22 '16 at 19:44
  • @GottfriedHelms - Thank you for the great discussion and the references. This has been tremendously helpful! – KirkD_CO Sep 22 '16 at 19:44
  • KirkDCO - you're welcome! If I find out more, I'll let you know. Btw. - if there is still some room for improvement of the title, then consider to do this : to make also this question a well titled helpful resource for later readers (for instance you might look for my proposal using the *revision history* which opens, when the *"edited Sep ..."* entry below your question is clicked) . – Gottfried Helms Sep 22 '16 at 19:55

1 Answers1

0

You're interpreting PCA incorrectly, I believe. The output of PCA is an orthogonal basis set of all the data observations, where the first principal component contributes the most variance, the second PC the second most, and so on so forth.

However, I'll take 'finding the column vectors most correlated with each PC' to mean 'I want to reduce my dimensionality/number of features. Selecting the variables most correlated with the PCs will not do this. But, there is a way! You can transform your observation matrix to the new basis found by your PCA, achieving just this. This is done by $T_{L} = X~W_{L}$, where $X$ is your observation matrix, $W$ is the loading matrix, $T$ is the transformed matrix, and $L$ is the number of principal components you want to keep. So, if you let $L=225$ (the maximum L, there cannot be more PCs than the original number of features), then nothing has changed. But, if you let $L=50$, then you're effectively projecting your data onto the 50th-dimension hyperplane, in the new principal component feature space. This is key! You are not SELECTING 50 features, but projecting your data onto the 50 most important PCs. So, your features will be different from the original ones, but they will be 50 features that are linear combinations (/projections) of the old ones.

Of course, there's also the fact that you have no idea if your dataset is linearly separable; you might want to check that out first before pursuing PCA.

Ben F
  • 66
  • 6
  • Your answer seems to be explaining how to compute PC scores. How does that answer the Q about the association between the variables and the components? – ttnphns Sep 16 '16 at 04:01
  • Thank you for the detail. I understand what PCA is doing and that each PC is a linear combination of the original columns of the data matrix. What I need to do is reduce the original space to about 50 columns in an effort to reproduce the entire space. In other words, I'm looking for 50 columns from the original space which will allow me to reproduce the entire original matrix with the least amount of error. Let's say I had a set of 50 columns and then developed linear models for each of the columns from the original matrix. From that, I would like the 50 columns that give the best predictions – KirkD_CO Sep 16 '16 at 04:16