Statistically correct to apply Multi-dimensional scaling or PCA to cosine similarity matrix?

Question

Supposing I have a document-term matrix as scripted below:

library(tidytext)
library(dplyr)
library(tidyr)
library(stylo)
library(ggplot2)

txt1 <- c("this is a sample document about Computer Science", "Not all sample documents are about science, they can be about art too", "and sometimes they are just useless")
names(txt1) <- c("Doc1", "Doc2", "Doc3")
df1 <- data.frame(doc = names(txt1), texts = txt1, stringsAsFactors = FALSE)



a1 <- df1 %>% unnest_tokens(output = word, input = texts) %>% 
group_by(doc, word) %>% 
summarise(totals = n()) %>% 
ungroup %>% 
spread(word, totals, fill = 0) %>% 
select(-doc) %>% as.matrix %>% 
scale(center = TRUE, scale = TRUE) %>% 
dist.cosine %>%
cmdscale %>%
data.frame

a1$name <- 1:nrow(a1)

ggplot(data = a1, aes(x = X1, y = X2)) + 
geom_text(aes(label = name))

Q1. Is it Statistically correct to apply Multi-dimensional scaling (in R, this is via cmdscale) or PCA on a Cosine Similarity matrix?
I think it is not, since Cosine Similarity is not a proper distance metric.

If the application is incorrect, could you please state why?
If the application is correct, could you again, please, state why?

Q2. Could you also state if z-score transformation is at all required for computing the cosine similarity?

It is perfetly fine to do PCA on cosines. Linear PCA is valid on any [sscp-type](http://stats.stackexchange.com/a/22520/3277) similarity coefficient. PCA on cosines is PCA on unit-scaled (but not centered) variables. As for MDS, it needs distance, not similarity. If you decide on a reasonable, sound way to convert, you may they do MDS, metric or nonmetric, on any distance measure. Torgerson's MDS (PCoA) is more reasonable to do on Euclidean distance. Cosine (or any sscp-similarity) and euclidean distance [are directly related](http://stats.stackexchange.com/a/36158/3277). — ttnphns, Oct 21 '16 at 20:11
@ttnphns thank you for the response. Would you like to add this as an answer? Or would you rather I add it on your behalf? — info_seeker, Oct 22 '16 at 06:46
I won't post it as an answer just because it is too brief, w/o theory explicated,... maybe later. If you were satisfied see it as an answer. — ttnphns, Oct 22 '16 at 09:28
@ttnphns I would certainly appreciate some elaboration, but it seems nobody else will be registering an answer, as the question has had less views from the community. — info_seeker, Oct 22 '16 at 10:17
@ttnphns Thanks for the references, great answers that I am just digesting! I am currently applying `cosine similarity` on `z-score` transformed values of my data set `(mean = 0, sd = 1)` (I understand this as Pearson Correlation), and converting it to distance using `1-(similarityMatrix)` for MDS. MDS doesn't expect proper distance metrics, does it - those that do not violate triangle inequality? — info_seeker, Oct 22 '16 at 12:16
`1-(similarityMatrix)` is euclidean d _squared_. Take the root. In general, MDS doesn't requite metricity. But metricity is always better - just because it is closer to the euclidean distances plotted on the map! — ttnphns, Oct 22 '16 at 12:38

Statistically correct to apply Multi-dimensional scaling or PCA to cosine similarity matrix?

0 Answers0