1

I have a matrix of users each with his/her page view counts over 50 pages. So, I have data points with 50 dimensions each.

What I wanted to find was --> what combination of pages explains the user data the most?

I did PCA and got that the first component explains 80% of the data's variance.

But I can not figure out how do I get which dimensions contribute the most to that component? i.e. their weights in the linear combination.

Since, PCA component is just the linear combination of individual dimensions, I should be able to do that somehow.

Is my approach wrong or is any method better suited to extract the particular information?

Thanks for your help.

Rafael
  • 1,109
  • 1
  • 13
  • 30
  • 1
    Look at [this](https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another) and [this](https://stats.stackexchange.com/questions/92499/how-to-interpret-pca-loadings) excellent discussions. – Krrr Aug 29 '17 at 11:19

1 Answers1

1

You're looking for eigenvector that corresponds to first singular value.

The problem with retrieving indices of its nonzero entries will most likely be that most or all entries are nonzero. That is because how PCA works - it finds the rotation of feature space such that its coordinates explain variance best along orthogonal directions.

Sparse PCA is used to retrieve sparse directions that explain the data best, with some constraints.

Sparse PCA is available for example in scikit-learn. You can also try to play with H2O's GLRM which is available in both R an Python.

Jakub Bartczuk
  • 5,526
  • 1
  • 14
  • 36
  • thanks for the answer. PCA will have all the 50 dimensions, but I can retrieve the weights and set some threshold as to what is more important right? I will try GLRM (it is a family of models?, any specific algo?) and Sparse PCA..any other method to explore that could get me the desired info? Thanks again :) – Rafael Aug 29 '17 at 06:49
  • 1
    You didn't specify what type of data you have, and GLRMs can model both continuous and categorical. If you're familiar with Eckart-Young theorem (or characterization of PCA as closest, in Frobenius norm, rank-k matrix) you can look up a notebook I have made on similar topic (direct generalization actually, for situations where only some matrix entries are given) [here](https://github.com/lambdaofgod/matrix-factorization/blob/master/notebooks/h2o%20GLRM.ipynb). – Jakub Bartczuk Aug 29 '17 at 07:32
  • This is continuous data version (don't mind the input data being discrete, for this kind of GLRM you need to assume continuity to make sense). – Jakub Bartczuk Aug 29 '17 at 07:34
  • @JakubBartczuk I think looking at eigenvector itself is not enough as it ignores eigenvalues. See the links that I posted a s comment to the question. – Krrr Aug 29 '17 at 11:22
  • @DataD'oh OP only mentioned directions. I don't see what is the purpose of using eigenvalues, since they only scale the directions. Do you mean that you should guide how many eigenvectors to use? – Jakub Bartczuk Aug 29 '17 at 11:38
  • @JakubBartczuk OP asks "...how do I get which dimensions [I read features] contribute **the most** to that component?" and from one of the discussions I linked "... loading matrix is informative: its vertical sums of squares are the eigenvalues, components' variances, and **its horizontal sums of squares are portions of the variables' variances being "explained" by the components**.". – Krrr Aug 29 '17 at 13:31