Distances in PCA space

Question

I'm working on a project involving PCA, and my knowledge up till now with this method is quite good. My work involves finding nearest neighbors (having the least Euclidean distance) to a particular spectrum in a database. So I reduce dimensionality of this database using PCA, where I project all the spectra onto the PCA space. Then I proceed to finding the spectrum's closest neighbors using the projected coefficients.

When I visualize PCA in a 2d space, I think of examples where a small distance in the PCA space do not correspond at all to a small distance in the original space. If the original space is 3d, and the PCA space is 2d for example, data-points lie "above" and "below" the PCA space(2d plane). So data-points might have similar projections, but in the original space are far from being close to each other. (Please correct me if I'm wrong)

My question is: is there a way to quantify this idea in order to achieve more accurate nearest neighbor search? And is there a way to represent the distance between the original and projected data-points in the PCA space (knowing that this distance is always orthogonal to the space)? P.S. I'm not a mathematician and I apologize for any incorrect terms.

Look at [Bottom to top explanation of the Mahalanobis distance?](http://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance) - it's that. — Piotr Migdal, Mar 10 '16 at 16:53
You seemingly raise talk about preserving distances in reduced space, rthen it means you speak about multidimensional scaling (MDS). PCA can be seen as the simplest form of MDS, but it isn't "true" MDS which is iterative. Please see [this](https://stats.stackexchange.com/a/14017/3277) and the whole thread there. If you want really decorous nearest neigh. search in a low dimensional space you ought to consider iterative MDS instead of PCA. — ttnphns, Sep 02 '17 at 09:50
If you need nearest neighbours, then why do you need to do PCA at all? The whole setting is not clear to me, so I vote to close as unclear. — amoeba, Aug 22 '18 at 19:48

score 5 · Answer 1 · edited Sep 02 '17 at 09:29

Bit late, but here we go:

The transformation spectra -> PC scores is typically set up to be a pure rotation. Thus Euclidean distance in PC score space equals Euclidean distance in original space as long as no PCs are discarded. Thus, neighbours stay neighbours.

For models that keep only some of the PCs, you can maybe construct a (squared) distance that distinguishes distance modeled from distance orthogonal to the model. This is e.g. done in SIMCA.

score 0 · Answer 2 · answered May 03 '15 at 21:53

Sounds like you want to know how to get from the PCA projection back to the original data for 1), and 2) what to do with nearest neighbors? Look at the PCA score coefficient matrix, which allows a back-projection. Regarding nearest neighbors after PCA, the focus commonly involves use of a doubly-centered Gram matrix ($G=XX^T)$. Hence, you probably need to work with the Gram matrix, which is used heavily in distance metric (non-linear manifold) learning.

No for 1) I need a way to differentiate between data-points that are actually neighbors, from those that are not. — Wfarah, May 03 '15 at 22:03

score 0 · Answer 3 · answered Aug 22 '18 at 17:04

I'm working with PCA coefficients now and probably you're done with your project by now but I think this might be helpful to others. In PCA the higher dimensions present less deviation from the axis, so discarding them does not loos as much information. Nonetheless the distance of the points are not maintained but the order of the distances will be the same as on average the dimension you are truncating represents less distance than any dimension before it.

Distances in PCA space

3 Answers3