Cosine-Similarity vs non-linear measures

Question

In NLP, people often use cosine similarity to measure how close two vector spaces are to each other. However, we know that cosine-similarity is the same thing as Pearson correlation, for centered vectors (Is there any relationship among cosine similarity, pearson correlation, and z-score?). To me, this means that we can view each vector as a random variable, and the values of the vector as realizations of the underlying distribution of that random variable.

In that case, we also know that correlation only measures linear dependence. So, my question is, why would cosine similarity be preferred to perhaps nonlinear measures of association between random variables, such as distance correlation (https://en.wikipedia.org/wiki/Distance_correlation)?

ttnphns · Answer 1 · 2018-08-07T05:40:14.717

Cosine similarity is not a measure of (the strenght of) linear association like Pearson r is, it is a measure of proportional association which is a narrower definition. The difference is in centration: r is cosine for centered data.

Cosine similarity is a measure of proportionality: if points of a bivariate data cloud lie on a straight line coming from the coordinates origin then cosine similarity is maximal, $cos_{xy}=1$. If that straight line of points does not come through the origin or if the points deviate from lying on a straight line then $cos_{xy}$ gets smaller. Because Pearson $r$ is $cos$ of the cloud centered by both axes a straight line of points would always pierce the origin, and therefore for $r$ only deviations from points' lying on the straight line can decrease the coefficient: correlation is the extent of linearity. When $cos$ is $1$ $r$ is also $1$ and full linearity is observed, however if $r$ is $1$ $cos$ is not necessarily $1$: full linearity is not enough for $cos$ to be max. $cos$ is anchored by an "external" point, the origin, $r$ is anchored only to the data cloud itself as represented by its mean.

From regressional standpoint, both $r$ and $cos$ are the $R_{regr}=\sqrt{(1-SS_{resid}/SS_{tot})}$, but $cos$ is about regression w/o intercept, i.e. with the regression line forced to come through the origin and $SS_{tot}$ are deviations from Y=0, not from Y=mean.

$Cos$ and $r$ are, respectively, the scalar product and the covariance, from which the coefficient's sensitivity to the variables' scale or magnitude has been removed.

So, cosine similarity and Pearson r aren't things to mix up in the question "what do they measure", as are covariance and Pearson r, too.

As for distance correlation, the idea behind it is different from both cosine or r. It captures the notion of generalized association - linear, nonlinear, curvilinear, and the notion is from the viewpoint of stochastic independence. With normal bivariate population zero Pearson r tells of the stochastic independence. Distance correlation generalizes to any distribution, and it does not center the data to its mean (because, at the "double centering" operation, euclidean distances are taken not squared).

Thanks for the great explanation on the difference between cosine similarity and other measures of association such as Pearson's correlation. I found something similar here, if people want another perspective: https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/ My question though is, OK, great, they are different measures, but they are clearly related to each other. I understand that cosine similarity is not shift invariant, but the *real* question is, why is cosine similarity used in NLP, versus "centered" cosine similarity, for example? — Kiran K., Aug 20 '18 at 01:23

score 0 · Answer 2 · answered Aug 08 '18 at 22:07

0

I guess you mean cosine similarity applied to word embeddings, which in turn are usually (L2) normalized before that, so cosine similarity is then just cosine of angle between the vectors and the same a nice measure of their 'similarity' - according to general idea of mapping words into n-dim space of 'latent semantic features'. Don't think it fits into concept of n-dim random variables and mentioned measures of correlation.

answered Aug 08 '18 at 22:07

MkL

126
3

Yes -- I was referring to cosine similarity applied to word embeddings. I guess a more fundamental question is, why use cosine similarity versus "centered" cosine similarity? Is there something specific to NLP which makes cosine similarity better than centered cosine similarity? – Kiran K. Aug 20 '18 at 01:27
Because of different concept behind cosine similarity and its centered version. There is hint in the article you found: cosine sim is related to vectors while centered version - to samples. In other words, adding 'centered' makes you substract a mean from vectors being compared and that's make a totally different measure - with different applications. – MkL Aug 28 '18 at 14:47

Cosine-Similarity vs non-linear measures

2 Answers2