29

I'm wondering if there is any relationship among these 3 measures. I can't seem to make a connection among them by referring to the definitions (possibly because I am new to these definitions and am having a bit of a rough time grasping them).

I know the range of the cosine similarity can be from 0 - 1, and that the pearson correlation can range from -1 to 1, and I'm not sure on the range of the z-score.

I don't know, however, how a certain value of cosine similarity could tell you anything about the pearson correlation or the z-score, and vice versa?

Jud
  • 443
  • 1
  • 5
  • 12
  • 1
    z score of *what*? z scores of *some* things might be related to Pearson correlation, Z scores of other things may not. For example, if you internally standardize your original variables then the Pearson correlation between x and y is the expected product of their z-scores. Or you might be talking about z-scores *of* Pearson correlations (Pearson correlations minus their expectation under some condition all divided by the standard error of the Pearson correlation), which would certainly be related to the Pearson correlation. – Glen_b Sep 19 '16 at 05:17
  • 1
    Direct relation: https://stats.stackexchange.com/a/22520/3277 – ttnphns Dec 04 '17 at 16:52

1 Answers1

48

The cosine similarity between two vectors $a$ and $b$ is just the angle between them $$\cos\theta = \frac{a\cdot b}{\lVert{a}\rVert \, \lVert{b}\rVert}$$ In many applications that use cosine similarity, the vectors are non-negative (e.g. a term frequency vector for a document), and in this case the cosine similarity will also be non-negative.

For a vector $x$ the "$z$-score" vector would typically be defined as $$z=\frac{x-\bar{x}}{s_x}$$ where $\bar{x}=\frac{1}{n}\sum_ix_i$ and $s_x^2=\overline{(x-\bar{x})^2}$ are the mean and standard deviation of $x$. So $z$ has mean 0 and standard deviation 1, i.e. $z_x$ is the standardized version of $x$.

For two vectors $x$ and $y$, their correlation coefficient would be $$\rho_{x,y}=\overline{(z_xz_y)}$$

Now if the vector $a$ has zero mean, then its variance will be $s_a^2=\frac{1}{n}\lVert{a}\rVert^2$, so its unit vector and z-score will be related by $$\hat{a}=\frac{a}{\lVert{a}\rVert}=\frac{z_a}{\sqrt n}$$

So if the vectors $a$ and $b$ are centered (i.e. have zero means), then their cosine similarity will be the same as their correlation coefficient.

TL;DR Cosine similarity is a dot product of unit vectors. Pearson correlation is cosine similarity between centered vectors. The "Z-score transform" of a vector is the centered vector scaled to a norm of $\sqrt n$.

GeoMatt22
  • 11,997
  • 2
  • 34
  • 64
  • +1. latexnazi comment: `\|` often looks better than `||`, and `\lVert ... \rVert` is the best way to write it. – amoeba Dec 04 '17 at 15:14