Visualizing multi-dimensional data (LSI) in 2D

Question

I'm using latent semantic indexing to find similarities between documents (thanks, JMS!)

After dimension reduction, I've tried k-means clustering to group the documents into clusters, which works very well. But I'd like to go a bit further, and visualize the documents as a set of nodes, where the distance between any two nodes is inversely proportional to their similarity (nodes that are highly similar are close together).

It strikes me that I can't accurately reduce a similarity matrix to a 2-dimensional graph since my data is > 2 dimensions. So my first question: is there a standard way to do this?

Could I just reduce my data to two dimensions and then plot them as the X and Y axis, and would that suffice for a group of ~100-200 documents? If this is the solution, is it better to reduce my data to 2 dimensions from the start, or is there any way to pick the two "best" dimensions from my multi-dimensional data?

I am using Python and the gensim library if that makes a difference.

Why do you need to reduce dimensionality? To construct the graph you want, you only need edges where the length of an edge is proportional to the distance between documents. You have that already, from the metric used for your k-means clustering. — Aman, Feb 05 '13 at 17:25
@Aman that does not work for displaying similarity between >2 documents on a 2D plane (graph). sure, i can plot points A and B with a separation based on k-means distance. but then when i need to plot point C, based on distances to A and B, typically there is no point in 2D space that satisfies all pairwise relationships. — Jeff, Feb 05 '13 at 21:22

score 7 · Accepted Answer · answered Jun 07 '11 at 04:15

This is what MDS (multidimensional scaling) is designed for. In short, if you're given a similarity matrix M, you want to find the closest approximation $S = X X^\top$ where $S$ has rank 2. This can be done by computing the SVD of $M = V \Lambda V^\top = X X^\top$ where $X = V \Lambda^{1/2}$.

Now, assuming that $\Lambda$ is permuted so the eigenvalues are in decreasing order, the first two columns of $X$ are your desired embedding in the plane.

There's lots of code available for MDS (and I'd be surprised if scipy doesn't have some version of it). In any case as long as you have access to some SVD routine in python you're set.

I think LDA would be better for this. PCA -as you get through SVD- would not preserve any cluster (class) discrimnatory information, which is what the OP is after. — Zhubarb, Apr 09 '15 at 08:13

score 0 · Answer 2 · answered Dec 20 '11 at 16:20

There is a piece of software called ggobi that can help you. It lets you explore multi-dimensional pseudo-spaces. It's mostly for data exploration but its interface is extremely friendly and 'it-just-works'!

You just need a CSV format (in R I usually just use write.csv with the default parameters) or an XML file (this format allows you more control; I usually save my table in CSV then export it to XML with ggobi and edit it manually for instance to change the order of some factors).

Visualizing multi-dimensional data (LSI) in 2D

2 Answers2

Linked