0

I am using SVD/PCA for text mining purposes.

Having a $(|terms|,|documents|)$ normalized matrix $M$, by applying SVD, I should be able to reduce the dimensionality and just keep the most meaningful dimensions.

By truncating the SVD to 2 components, $U_2$ and $V^T_2$ should contain the 2-dimensional spatial representation of terms and documents. This should tell me which terms are closer to which documents: plot

I've seen several examples where only $U$ is visualized, so I'm not sure that my idea of plotting documents is correct. This said, I've also seen that most of PCA implementation return $U\cdot\Sigma$, so this makes me wonder:

  1. Is this idea correct?
  2. Should I perform a dot product with $\Sigma$ on $U$ and/or $V^T$?
  3. Why are some documents so distant from the words, since they surely contain at least one of them?
Vektor88
  • 103
  • 4
  • I take it that by $\Sigma$ you mean singular values. It looks to me that your question is about ways to do **biplot**. On the svd-based biplot, you can (and may) show only $V$, only $U$, or both. And with various normalizations. All these are valid but they convey different nuances of information. – ttnphns Aug 13 '15 at 13:25
  • 1
    _If_ you are new to biplots it might be hard at first to capture the theme. I would then recommend you to study Q/A on this site tagged `biplot`. Among what you can find, my own, sufficiently detailed and dense answer with pictures is [here](http://stats.stackexchange.com/q/141754/3277) (start with pictures, to become involved). – ttnphns Aug 13 '15 at 13:26

0 Answers0